Mini-homework 4 – Flights of New York
Instructions
Obtain the GitHub repository you will use to complete the mini-homework, which contains a starter file named flights_of_new_york.Rmd. This mini-homework will help you become more familiar with manipulating datasets using R by guiding you through some examples. Read the About the dataset section to get some background information on the dataset that you’ll be working with. Each of the below exercises are to be completed in the provided spaces within your starter file flights_of_new_york.Rmd. Then, when you’re ready to submit, follow the directions in the How to submit section below.
About the dataset
The dataset we are working with in this mini-homework is the flights dataset that is loaded via the nycflights13 R package. This dataset contains on-time data for all flights that departed from a New York City airport (airport codes: JFK, LGA, EWR) in 2013.
Exercises
Start working on these exercises after you have successfully cloned your repository into RStudio Server.
As always, you should inspect a new dataset to become more familiar with what it contains. To do this in RStudio, type the following command in your Console window (not in the R Markdown file):
View(flights)
Since the flights dataset is loaded as an R package via the
library(nycflights13)
in your setup code block, additional documentation is available. To read more about the dataset, type the following command in your Console window:?flights
Use these resources to answer these initial questions.
How many rows and columns does this dataset have?
What does a single row in this dataset represent?
What is the difference between the information contained in the dep_time and sched_dep_time columns?
Which columns contain information about dates and times?
Airplanes are reused across many different flights. Which columns would be helpful to use in identifying individual airplanes?
Let’s start with the
select()
function. Theselect()
function selects columns from a dataset, which is useful when you’re working with a dataset that contains dozens of variables. Try running the following code:flights %>% select(year, month)
Based on the output, explain what happens when you run this command.
What does %>% mean?
The strange looking symbol %>% is called the pipe, and it is a handy way to pass a dataset through a chain of commands. All examples going forward will use the pipe whenever possible.
There are multiple ways to specify columns with the
select()
command. To demonstrate this, copy the code you wrote in Exercise 2 and replaceselect(year, month)
withselect(year:day)
. What does the colon : do?The sort operation is a common and indispensible operation for organizing data, and the function from dplyr that allows us to sort is called
arrange()
.arrange()
sorts columns with textual data (chr data type) into alphabetical order and sorts numerical data into numerical order.Try running the following code:
flights %>% arrange(month, day)
Based on the output, does it look like both the month and day columns were sorted? Which column was sorted first? What happens if you reverse the order you specify the columns in
arrange()
?Important!
Students often find that there is a base R function named
sort()
as well, leading to confusion later in the course. Usingsort()
the same way you usearrange()
will give you errors. For the duration of course we will always prefer to use the tidyverse version of a function, so always usearrange()
and pretendsort()
doesn’t exist.By default,
arrange()
will sort data in ascending order. The functiondesc()
can be used to sort in descending order. For example, to sort by months in reverse order, we would usearrange(desc(month))
Let’s use it to answer a simple question about the dataset: what flight experienced the longest departure delay? Identify the column that gives information about flight departure delays, and then usearrange()
in combination withdesc()
to sort the flights dataset to find the flight with the longest departure delay.Let’s now try an example that uses the
mutate()
function, which is a little more complex.mutate()
lets us transform a dataset by applying the same operation to each row in the dataset and appending the results as a new column. More simply, this is how you can add, subtract, multiply, and divide different two or more columns against each other. As an example, run the following command to calculate the average speed for each flight in miles per hour:flights %>% mutate( average_speed = distance / (air_time * 60) )
Where does the new column you just computed show up in the dataset and what is the name of this new column? What part of the code is controlling the name of the new column?
We are not limited to only one input at a time in
mutate()
. As long as we separate each input by a comma, we can put as many inputs as we want in themutate()
function! As an example, let’s take thedep_time
column, which gives the clock time as an integer (for example, 11:00am is 1100) and separate it into an hours column and minutes column:flights %>% mutate( dep_time_hour = dep_time %/% 100, dep_time_minute = dep_time %% 100 )
Next, copy and paste this code into a new code block, add a second comma, and then add a third line. Have this third line use the
dep_time_hour
anddep_time_minute
columns to compute the number of minutes since midnight. Name this third columndep_time_minutes_midnight
.What do %/% and %% mean?
The operators %/% and %% let you do modular arithmetic. The symbol %/% means integer division and %% means remainder. As a quick example, you might remember learning about proper and improper fractions in math class. \(\dfrac{9}{4}\) is an example of an improper fraction. To convert it into a proper fraction, first we need to figure out how many times 4 goes into 9. We can easily compute this with integer division:
9 %/% 4
# [1] 2
We also need to know the remainder, or how much is left over after dividing 4 into 9.
9 %% 4
# [1] 1
So the proper fraction representation of \(\dfrac{9}{4}\) is \(2 \frac{1}{4}\).
Next we consider the
filter()
function, which provides a ruled-based way to keep a subset of rows and remove the rest. Here we just consider rules that are simple comparisons, which involve the symbols:- >: greater than
- >=: greater than or equal to
- <: less than
- <=: less than or equal to
- !=: not equal
- ==: equal
Give
filter()
a try by running the following two code blocks:flights %>% filter( arr_delay < 0 )
flights %>% filter( carrier == "UA" )
After running the above code, figure out how to combine the two examples in one
filter()
function (hint: it resembles what you did withmutate()
). This will tell you all the flights operated by United Airlines (UA) that arrived early.It is common to want to summarize the information contained within a dataset, such as computing sums and averages, or counting how many data points belong to different groups. This is called data aggregation, as it aggregates many data points together and uses them to compute a cetain quantity. We perform data aggregation in R by using the commands
group_by()
andsummarize()
, which frequently show up as a pair.The
group_by()
command is applied to one or more columns, and allows you create groups that share common values in a column of categorical data. Thesummarize()
command can then be used to compute quantities like averages within those groups. It will help to see an example. Run the following code, which calculates the mean (average) arrival delay for each airline carrier:flights %>% group_by(carrier) %>% summarize( average_arr_delay = mean(arr_delay, na.rm = TRUE) )
Which airline carrier had the longest arrival delays on average? Which airline carrier had the shortest arrival delays on average?
How to submit
To submit your mini-homework, follow the two steps below. Your homework will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Mini-homework 4 posting on Blackboard.
Cheatsheets
You are encouraged to review and keep the following cheatsheets handy while working on this mini-homework:
Credits
nycflights13 data set source: Hadley Wickham. 2018. nycflights13: Flights that Departed NYC in 2013. https://cran.r-project.org/web/packages/nycflights13/index.html
The example in Exercise 6 and the example and question of Exercise 7 sourced from Chapter 5 of R for Data Science by Garrett Grolemund and Hadley Wickham, used under CC BY-NC-ND 3.0.