Welcome
In this tutorial, we will explore a dataset about Starwars characters, have fun with numbers, and learn how to answer questions from data.
The tutorial is based on datasciencebox.org.
Star Wars data
The data format is a table with 87 rows and 14 columns. Each row corresponds to a character in Star Wars, and each column represents a variable for each character.
Recap: What can be counted as a variable? What does each of the following variable belong to? What are their data types?
The variables include
- name: name of the character
- height: height (cm)
- mass: weight (kg)
- hair_color, skin_color, eye_color: hair, skin, and eye colors
- birth_year: year born (BBY = Before Battle of Yavin)
- sex: the biological sex of the character, namely male, female, hermaphroditic, or none (as in the case for Droids)
- gender: the gender role or gender identity of the character as determined by their personality or the way they were programmed (as in the case for Droids)
- homeworld: the name of homeworld
- species: the name of species
- films: a list of films the character appeared in
- vehicles: a list of vehicles the character has piloted
- starships: a list of starships the character has piloted
Reference: dplyr.tidyverse.org
# Specify the number of characters you'd like to look at
head(starwars, n = _)
head()
lets you take a quick look at the first part of a
data-storing object. We can specify how many rows we’d like to see
through the argument n
.
We will ask and answer some questions about the characters. Feel free to make a guess or say out the answer before digging into the data. :)
Who is the tallest character in Star Wars?
which.max(starwars$height)
which.max
determines the location, i.e., index of the
(first) maximum of a numeric vector. Once we know the location of the
tallest chatacter, we can find their name of that specific location.
starwars[which.max(starwars$height), "name"]
We could also look a bit more into the heights (numerical values) of all the characters in Star Wars.
summary(starwars$height)
summary
gives summary statistics of many numbers. In
statistics, given many data points, we can divide the number of data
points into four parts.
The smallest number (minimal or
Min.
) is the minimum of the data points.The first quartile (
1st Qu.
) is defined as the middle number between the smallest number and the median of the data points. \(25\%\) of the data are smaller than this number.The second quartile is the
Median
of the data points. \(50\%\) of the data are smaller than this number.The third quartile (
3rd Qu.
) is the middle value between the median and the highest value of the data points. \(75\%\) of the data are smaller than this number.The largest number (maximum or
Max.
) is the maximum of the data points.In R, missing values are represented by the symbol
NA
(not available).NA's
tells us the number of missing values.
plot(hist(starwars$height))
hist
computes a histogram of the given data values. A
histogram is a graph that shows the distribution of numeric values.
Who has pink eyes in Star Wars?
which(starwars$eye_color == "pink")
starwars[which(starwars$eye_color == "pink"), "name"]
starwars[which(starwars$eye_color == "pink"), "skin_color"]
unique(starwars$eye_color)
unique
returns all the unique elements of a vector.
length(unique(starwars$eye_color))
length
returns the length of a vector.
table(starwars$eye_color)
length(unique(starwars$___))
Make plots
Look at the relationship between the height and weight of each character.
ggplot(starwars) +
geom_point(aes(x = height, y = mass, color = gender)) +
theme_classic() + theme(legend.position = "bottom")
ggplot(starwars) +
geom_point(aes(x = height, y = mass, color = gender)) +
xlab("___") +
ylab("___") +
theme_classic() + theme(legend.position = "bottom")
Modify the plot above to look at the relationship between height and birth year of each character.
ggplot(starwars) +
geom_point(aes(x = height, y = ___, color = gender)) +
theme_classic() + theme(legend.position = "bottom")
Other questions
Try to raise other questions about the characters, and see if you could figure out the answers by playing around with the rich dataset! :)
# try it out!