Skip to Tutorial Content

Welcome

In this tutorial, we will explore a dataset about Starwars characters, have fun with numbers, and learn how to answer questions from data.

The tutorial is based on datasciencebox.org.

Live long and prosper.

Live long and prosper.

Star Wars data

The data format is a table with 87 rows and 14 columns. Each row corresponds to a character in Star Wars, and each column represents a variable for each character.

Recap: What can be counted as a variable? What does each of the following variable belong to? What are their data types?

The variables include

  • name: name of the character
  • height: height (cm)
  • mass: weight (kg)
  • hair_color, skin_color, eye_color: hair, skin, and eye colors
  • birth_year: year born (BBY = Before Battle of Yavin)
  • sex: the biological sex of the character, namely male, female, hermaphroditic, or none (as in the case for Droids)
  • gender: the gender role or gender identity of the character as determined by their personality or the way they were programmed (as in the case for Droids)
  • homeworld: the name of homeworld
  • species: the name of species
  • films: a list of films the character appeared in
  • vehicles: a list of vehicles the character has piloted
  • starships: a list of starships the character has piloted

Reference: dplyr.tidyverse.org

# Specify the number of characters you'd like to look at
head(starwars, n = _)

head() lets you take a quick look at the first part of a data-storing object. We can specify how many rows we’d like to see through the argument n.

We will ask and answer some questions about the characters. Feel free to make a guess or say out the answer before digging into the data. :)

Who is the tallest character in Star Wars?

which.max(starwars$height)

which.max determines the location, i.e., index of the (first) maximum of a numeric vector. Once we know the location of the tallest chatacter, we can find their name of that specific location.

starwars[which.max(starwars$height), "name"]

We could also look a bit more into the heights (numerical values) of all the characters in Star Wars.

summary(starwars$height)

summary gives summary statistics of many numbers. In statistics, given many data points, we can divide the number of data points into four parts.

  • The smallest number (minimal or Min.) is the minimum of the data points.

  • The first quartile (1st Qu.) is defined as the middle number between the smallest number and the median of the data points. \(25\%\) of the data are smaller than this number.

  • The second quartile is the Median of the data points. \(50\%\) of the data are smaller than this number.

  • The third quartile (3rd Qu.) is the middle value between the median and the highest value of the data points. \(75\%\) of the data are smaller than this number.

  • The largest number (maximum or Max.) is the maximum of the data points.

  • In R, missing values are represented by the symbol NA (not available). NA's tells us the number of missing values.

plot(hist(starwars$height))

hist computes a histogram of the given data values. A histogram is a graph that shows the distribution of numeric values.

Who has pink eyes in Star Wars?

which(starwars$eye_color == "pink")
starwars[which(starwars$eye_color == "pink"), "name"]
What is the skin color of the person with pink eyes?
starwars[which(starwars$eye_color == "pink"), "skin_color"]
How many eye colors are there among all the Star War characters?
unique(starwars$eye_color)

unique returns all the unique elements of a vector.

length(unique(starwars$eye_color))

length returns the length of a vector.

How many characters are there with each specific eye color?
table(starwars$eye_color)
Can you figure out how many species there are in total?
length(unique(starwars$___))

Make plots

Look at the relationship between the height and weight of each character.

ggplot(starwars) +
  geom_point(aes(x = height, y = mass, color = gender)) +
  theme_classic() + theme(legend.position = "bottom")
Can you modify the plot to include the unit for height and weight on both the x and y axis?
ggplot(starwars) +
  geom_point(aes(x = height, y = mass, color = gender)) +
  xlab("___") +
  ylab("___") + 
  theme_classic() + theme(legend.position = "bottom")

Modify the plot above to look at the relationship between height and birth year of each character.

ggplot(starwars) +
  geom_point(aes(x = height, y = ___, color = gender)) +
  theme_classic() + theme(legend.position = "bottom")

Other questions

Try to raise other questions about the characters, and see if you could figure out the answers by playing around with the rich dataset! :)

# try it out! 

Star Wars