Waddling into Data Visualization: A Cool Guide to Penguins and ggplot

Palmer Penguins

Introduction to the Palmer Penguin Dataset

The Palmer Penguin dataset is a popular dataset for data analysis and visualization in R. It contains measurements of body weight, bill length, flipper length, and body mass for three different species of penguins: Adelie, Gentoo, and Chinstrap.

The dataset was collected by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) program, which is part of the National Science Foundation (NSF). The data was collected during the breeding season from 2007-2009 at Palmer Station, Antarctica.

The Palmer Penguin dataset contains 344 observations and 8 variables:

species: The species of penguin (Adelie, Gentoo, or Chinstrap)
island: The island where the penguin was observed (Biscoe, Dream, or Torgersen)
bill_length_mm: The length of the penguin’s bill in millimeters
bill_depth_mm: The depth of the penguin’s bill in millimeters
flipper_length_mm: The length of the penguin’s flipper in millimeters
body_mass_g: The body mass of the penguin in grams
sex: The sex of the penguin (male or female)
year: The year the data was collected (2007, 2008, or 2009)

The Palmer Penguin dataset is a great dataset for exploring data visualization in R, as it contains a variety of numerical and categorical variables that can be used to create interesting and informative plots. In the following sections, we will explore some of the basic and advanced plots that can be created using the Palmer Penguin dataset and the ggplot package.

Load the Palmer Penguin dataset

You can load this package and dataset by running the following code in R:

# Load the palmerpenguins package
library(palmerpenguins)

# Load the penguins dataset
data("penguins")

Explore the Palmer Penguin dataset

Before we start visualizing the data, it is always a good idea to explore the dataset. We can do this by using various R functions to get an idea of the structure of the data. Here are some useful R functions:

# View the first few rows of the dataset
head(penguins)

# View the last few rows of the dataset
tail(penguins)

# Get a summary of the dataset
summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

# Get the number of rows and columns in the dataset
dim(penguins)

## [1] 344   8

Running the above code will give us a good overview of the Palmer Penguin dataset. We can see that the dataset has 344 observations and 8 variables, and that there are no missing values. We can also see the range and distribution of each variable, which can help us choose appropriate scales and axes when creating plots.

Data Visualization

Numerical summaries are useful for getting a general idea of the data, but they often don’t tell the whole story. In this section, we will explore how to create data visualizations using the ggplot2 package. Data visualization is a powerful tool not only for exploring relationships in the data, but also for communicating your findings to others.

Introduction to ggplot

ggplot2 is a powerful data visualization package in R that allows you to create beautiful and informative plots with relatively few lines of code. The ggplot package is based on the “Grammar of Graphics”, which is a framework for thinking about how to construct visualizations from basic building blocks such as geometric shapes, aesthetics, and scales.

The basic syntax for creating a plot with ggplot is as follows:

ggplot(data = <DATA>, aes(x = <X>, y = <Y>, color = <COLOR>, shape = <SHAPE>)) +
  <GEOM_FUNCTION>()

Here, data refers to the dataset you want to plot, aes stands for aesthetics, and specifies how to map variables in the dataset to visual properties such as x and y axis, color, and shape. <GEOM_FUNCTION> refers to the type of plot you want to create, such as geom_point() for a scatterplot or geom_histogram() for a histogram.

Histograms

A histogram is a type of chart that is used to display the distribution of a dataset. It is a graph consisting of a series of rectangles (called bins) that are placed side-by-side along an axis (usually the x-axis). The height of each rectangle represents the number or proportion of data points that fall within a given range of values (called the bin width), and the width of each rectangle represents the size of the range.

Histograms are useful for understanding the shape of a distribution, as well as identifying outliers and clusters within the data. They can also be used to compare the distributions of different groups or variables within a dataset.

For example, a histogram of the body mass of penguins might show that the majority of penguins have a body mass between 3000 and 5000 grams, with a few outliers on either end of the distribution. By looking at the histogram, we can quickly get a sense of the overall shape and spread of the data.

Create a histogram of penguin body mass

Let’s use ggplot to create some visualizations of the Palmer Penguin dataset. We will start by creating a histogram of body mass

# Initialize the ggplot object with the penguins dataset
ggplot(data = penguins, aes(x = body_mass_g)) +
  # Create a histogram with 30 bins
  geom_histogram(bins = 30) +
  # Alternatively, you can specify the bin width. Uncomment this
  # line and comment out the line above to try it out!
  # geom_histogram(binwidth = 100) +
  labs(
    title = "Distribution of Body Mass for Palmer Penguins",
    x = "Body Mass (g)", y = "Count"
    ) # Every good plot needs a title and axis labels!

Try changing the number of bins in the histogram to see how it affects the shape of the distribution. Does this change your interpretation of the data?

There are other variables you could explore further, try changing the x-axis variable to bill_length_mm or bill_depth_mm and see what you find. Be sure to update your axis labels accordingly!

Box plots

Like histograms, box plots are a type of chart that is used to display the distribution of a dataset. However, they are more useful for comparing the distributions of different groups or variables within a dataset. They are also useful for identifying outliers and clusters within the data.

Box plots also show specific values such as the median (the middle value), the first quartile (the value that is greater than 25% of the data), and the third quartile (the value that is greater than 75% of the data). These are represented by the middle, top and bottom lines of the box, respectively. The whiskers represent the range of the data, and the outliers are represented by the points outside of the whiskers.

Let’s create a box plot of the body mass of the penguins, grouped by species:

# Initialize the ggplot object with the penguins dataset
ggplot(data = penguins, aes(x = species, y = body_mass_g)) +
  # Create a boxplot
  geom_boxplot() +
  labs(
    title = "Distribution of Body Mass for Palmer Penguins",
    x = "Species", y = "Body Mass (g)"
    )

Try changing the x-axis variable to sex or island and see what you find. Be sure to update your axis labels accordingly!

Scatterplots

Histograms and boxplots are useful for understanding the distribution of a single variable, but what if we want to explore the relationship between two variables? Scatterplots are a great way to visualize the relationship between two numerical variables. They are useful for identifying outliers, clusters, and trends in the data.

A scatter plot is a type of chart that is used to display the relationship between two variables. It is a graph consisting of a series of points, where each point represents a single observation in the dataset. The x-axis represents one variable, and the y-axis represents the other variable. Scatter plots are useful for identifying patterns, trends, and outliers in the data, as well as visualizing the strength and direction of the relationship between the two variables.

Let’s start with a basic scatter plot that displays the relationship between the body mass and flipper length of the penguins:

Create a scatterplot of body mass and flipper length

ggplot(
    data = penguins,
    aes(x = body_mass_g, y = flipper_length_mm)
  ) +
  geom_point()

Here we specify that the x coordinate represents the body mass, while the y coordinate represents the flipper length. The geom_point() function is then used to add points to the plot, with each point representing a single observation in the dataset.

This plot shows that there is a positive relationship between body mass and flipper length, with larger penguins having longer flipper lengths. We can also try exploring other variables in the dataset, such as the bill length and depth. Try changing the x and y variables in the code above to see what you find! For reference, here is a list of the variables in the dataset:

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Adding additional information to the plot

So far, we have created a scatter plot that shows the relationship between body mass and flipper length for all penguins in the dataset. However, what if there are different relationships between these variables for different species of penguins? To address this issue, we can use color, size, and shape to add additional information to the plot. For example, we can use color to indicate the species of the penguins:

ggplot(
    data = penguins,
    aes(x = body_mass_g, y = flipper_length_mm, color = species)
  ) +
  geom_point()

Here we add the color = species argument to the aes() function, which tells ggplot() to use the species variable to determine the color of each point on the plot.

We can also use the size of points to indicate another variable in the dataset. For example, we can use size to indicate the bill depth of the penguins:

ggplot(
    data = penguins,
    aes(
      x = body_mass_g, y = flipper_length_mm,
      color = species, size = bill_depth_mm
    )
  ) +
  geom_point()

This code adds the size = bill_depth_mm argument to the aes() function, which tells ggplot() to use the bill_depth_mm variable to determine the size of each point on the plot. This plot shows that there is some variation in bill depth within each species, but that Gentoo penguins tend to have the smallest bill depth.

Finally, we can use shape to add another dimension of categorical information to the plot. For example, we can use shape to indicate the sex of the penguins:

ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm, color = species, size = bill_depth_mm, shape = sex)) + 
  geom_point()

This code adds the shape = sex argument to the aes() function, which tells ggplot() to use the sex variable to determine the shape of each point on the plot. This plot shows that male and female penguins have different shapes, with triangles representing males and circles representing females. It also shows that there is some overlap between the sexes for each species, but that male penguins tend to have larger body mass, flipper length, and bill depth than female penguins for all three species.

Conclusion

In this tutorial, we learned how to use ggplot2 to create histograms, boxplots, and scatterplots. We also learned how to add additional information to our plots using color, size, and shape. These plots can be used to explore the distribution of a single variable, as well as the relationship between two variables.