CDS 101: Introduction to Computational and Data Sciences

Instructions

Obtain the GitHub repository you will use to complete the lab, which contains a starter file named lab10.Rmd. This lab guides you through the process of building an automated pipeline to download and clean a series of files in order to build a database of average daily temperatures for you to explore. Carefully read the lab instructions and complete the exercises using the provided spaces within your starter file lab10.Rmd. Then, when you’re ready to submit, follow the directions in the How to submit section below.

Simplify your life with automation

Many problems in data science are solved using a series of repetitive and predictable tasks. Perhaps you’ve noticed this yourself as you’ve completed the labs in this course, for example when you’re asked to create data visualizations for several different columns in a dataset or you’re asked to apply the same general sequence of data transformations with slight modifications. Maybe you’ve found an interesting dataset that is split into dozens of different files that you have to download individually and then try and combine. From start to finish, there can be many times you think to yourself “this is just the same code from before with one change, why do I need to do this again?” It is when these moments and patterns show up that it is worth considering the question, “can I automate this?”

Learning how to use R to automate even a small subset of your data science brings with it many benefits, such as:

Writing less code to complete a data science problem
Reducing the number of errors that can be introduced through copying and pasting code that you then manually edit
Making it easier to fetch updates from a data source and integrate them into your existing database

So why haven’t we been using these ideas since the beginning of the course? It is because, in practice, automating your work requires the use of computer programming concepts that are more advanced. This can get in the way of learning the basics of data science, such as visualization, data transformations, statistical analysis, and writing reports to summarize and present your work. Also, early on in the course when you are asked to complete simpler exercises, applying automation may appear (and indeed be) unnecessary, and just add to the confusion of why we should bother with it. Now that you have some experience with R and the tidyverse, it is worth stepping through a straightforward, real-world example that demonstrates how this can be used to your benefit.

For this lab, we will put together an automated pipeline to download, read, and combine the data files from the University of Dayton - Environmental Protection Agency Average Daily Temperature Archive, http://academic.udayton.edu/kissock/http/Weather/default.htm. After we assemble our pipeline to acquire our data, we will then explore our new dataset using our familiar tidyverse tools.

Let’s begin!

About the dataset

The Average Daily Temperature Archive reports the average daily temperature from January 1995 to the present day for 157 U.S. and 167 international cities and are updated on a regular basis. Source data are from the Global Summary of the Day (GSOD) database archived by the National Centers for Environmental Information (formerly the National Climatic Data Center). The average daily temperatures posted on this site are computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.

As part of this lab, you will be walked through the procedure of downloading the data files and reading them into R. The data files themselves are plain text and format the data into four columns. The table below provides descriptions for the 4 variables stored within each file,

Column	Description
1	integer between 1 – 12 (January – December) corresponding to the month
2	integer between 1 – 31 corresponding to the calendar day in a given month
3	integer between 1995 – 2018 indicating the year
4	the average daily temperature for the given day

Temperature values (column 4) of -99 indicate that the measurement for that day is unknown.

Downloading the dataset

Downloading a file from a website using R is straightforward, you just need to know the file URL.

Go to the website hosting the dataset, http://academic.udayton.edu/kissock/http/Weather/default.htm, and look for a link on the page labeled “All sites (in single 10 MB file)”. Right-click on this link and choose the option Copy link address (depending on your web browser, this option may instead be called Copy link, Copy link location, or similar). Then, go back to your R Markdown report file in RStudio Server and assign the link to a variable named allsites_zip_url, like so:
```
allsites_zip_url <- "http://dataset.url/goes/here"
```
Note that the pasted link needs to be placed in quotation marks.

Now that we have the URL, we need to specify where we want to download the file. We will use the convenient fs package to deal with filenames and directories, which is already loaded for you in the setup block of your R Markdown lab report. We will be downloading the file to the data/ folder, which should already be part of your lab report repository.

To specify the data/ folder for later use, put the following code in your R Markdown lab report and run it:
```
data_dir <- path("data")
```

Now that we have the download folder to use, we also need to give the name of the file we are downloading, and tell R that we will want that file to be located inside the data/ folder:

To specify the name of the file we are downloading and to have it end up inside the data/ folder, put the following code in your R Markdown report and run it:
```
allsites_zip_path <- path(data_dir, "allsites", ext = "zip")
```
In a second code block, run allsites_zip_path to confirm that you see data/allsites.zip.

Now it’s time to download the file.

To download the file, put the following code in your R Markdown report and run it:

if (!file_exists(allsites_zip_path)) {
  allsites_zip_url %>%
    curl_download(destfile = allsites_zip_path)
}

When the code block has finished running, verify that it worked by going into your files browser in the lower-right window of RStudio and clicking on the data/ folder. You should see a file named allsites.zip that is approximately 13.7 MB in size.

What is going on in the code block that you just ran? The curl_download() function comes from the curl package that was also pre-loaded in your R Markdown setup block. This function does what it says, it downloads things. You are downloading the file located at allsites_zip_url and saving it to the location allsites_zip_path (destfile is short for “destination file”).

What about the part if (!file_exists(allsites_zip_path))? First, try running the code block from Exercise 4 again. You should notice that the block seems to run a lot faster this time.

Next, in a new code block below Exercise 5 in your lab report, run,
```
file_exists(allsites_zip_path)
```
Then, in another new code block, run
```
!file_exists(allsites_zip_path)
```
Based on what these these code blocks are giving you, and the fact that the code block in Exercise 4 now runs faster, explain what the full line if (!file_exists(allsites_zip_path)) is doing in the code.

Unzipping the data files

We have our allsites.zip file, which is a file in the zip format. We need to “unzip” this file in order to access the many files that are contained within it.

In a code block, run the following code:
```
allsites_zip_path %>%
  unzip(exdir = data_dir)
```
Check the data/ folder again in your Files tab in the lower-right side of RStudio Server. What has happened after you ran this command?

Extracting information from a file name

There are a lot of data files contained in this zip file, so let’s just focus on one for now.

Specify the data file for Huntsville, Alabama by running the following code block:
```
data_file <- path(data_dir, "ALHUNTSV", ext = "txt")
```

If you run data_file in your Console window, you’ll get

## [1] “data/ALHUNTSV.txt”

as output. There are some convenience functions available for interacting with the path format. The path_file() function returns just the filename part of a path. For example:

example_file_path <- path("content", "datafile", ext = "txt")
example_file_path %>%
  path_file()

## [1] “datafile.txt”

If we wanted the directory part of the path, we use path_dir():

example_file_path %>%
  path_dir()

## [1] “content”

If we wanted to remove the .txt extension, we use path_ext_remove():

example_file_path %>%
  path_ext_remove()

## [1] “content/datafile”

Apply two of the path_ functions to reduce the data_file file path from data/ALHUNTSV.txt to ALHUNTSV. Assign the result to data_file_name.

The file names tell us the locations where the temperature measurements come from. If the location is in the United States, then the first two letters are the two-letter abbreviation for a state and the remaining six letters give the city. So, for ALHUNTSV.txt, the first two letters are AL for Alabama, and the last six letters are HUNTSV, for Huntsville. For international locations, the first two letters are a two-letter country code and the last six are still a city name, for example AUPERTH.txt contains temperature measurements from Perth, Australia. Because we will ultimately want to use all of the files in the data/ directory, we should have a method for extracting the city name and the state/country codes directly from the filenames.

The str_sub() function from the stringr package allows us to extract text based on the location of the letters. For example, let’s consider the small string of text “CDS102”.

cds102 <- "CDS102"

The course code CDS starts on the first letter and ends on the third letter. To get this part of the text using str_sub(), I would run:

cds102 %>%
  str_sub(start = 1, end = 3)

## [1] “CDS”

The course number starts on the fourth letter and then goes to the end. Note that, if the part of the string you are grabbing goes to the end, you don’t need to specify the end = keyword. So, to get the “102” part of the text using str_sub(), I would run:

cds102 %>%
  str_sub(start = 4)

## [1] “102”

Apply str_sub() to data_file_name in order to extract the first two letters AL, and assign this to the variable file_state. Then, apply str_sub() again in order to extract the remaining letters HUNTSV and assign this to the variable file_city.

Read a data file, label the dataset

The data in ALHUNTSV.txt is not structured in the conventional csv format. If we look at the first few lines of the file,

data_file %>%
  read_lines(n_max = 5) %>%
  str_flatten(collapse = "\n")

 1             1             1995         48.8
 1             2             1995         32.1
 1             3             1995         31.2
 1             4             1995         32.5
 1             5             1995         21.1

we see that there are four columns of data separated by spaces. Because the separation of the spaces is predictable, we will want to use the read_table() function to read this file. Since this data file does not specify the column names at the top of the file, we will need to input those manually.

Review the about the dataset section and list the column names for ALHUNTSV.txt as inputs to the combine() function. As a reminder, the template for using combine() is,
```
combine("column name 1", "column name 2")
```
Assign this to a variable named col_names. Then, below col_names in the same code block, write:
```
alhuntsv <- data_file %>%
  read_table(col_names = col_names) %>%
  mutate(
    state = ...,
    city = ...
  ) %>%
  mutate(
    date = make_date(
      year = ...,
      month = ...,
      day = ...
    )
  )
```
You will need to replace the elipses … to make this work. The first mutate() is used to label the dataset with the city and state where the temperature data was measured. Fill those inputs in using a variable you assigned earlier in the lab. The second mutate() creates a column with the date data type using the make_date() function from the lubridate package. Fill in the column names (these are what you listed in col_names) that correspond to the year, month, and day of the temperature measurements.

One data file down, only 324 more to go!

User-defined functions

Before we can continue, we need to take a detour and learn about the concept of user-defined functions, which are the fundamental building blocks for automation. We’ve made use of many different functions during this course that were either pre-loaded by R or were loaded after running library(tidyverse). Any command that you run that has parentheses where you specify inputs, such as ggplot(), filter(), mutate(), mean(), etc. are functions. We’ve managed to accomplish a lot simply by relying on these pre-loaded functions! Now it’s time for you to create your own.

Functions that you create are known as user-defined functions, and an example of a simple user-defined function is as follows:

add <- function(number1, number2) {
  result <- number1 + number2
  cat(number1, "plus", number2, "equals", result, "\n")
}

This creates a user-defined function called add() that requires two inputs, number1 and number2. It then adds those numbers and prints a sentence that summarizes the computation. Try it out!

Copy and paste the code that defines the add() function into a code block and run the block. Create a second code block and run add(4, 6) and verify that you get some output. Then, right below add(4, 6), type add() again and fill in your own two inputs and execute it.

Although this isn’t the most useful function that you can create, it does illustrate the basic procedure for creating a user-defined function in R. Let’s review what we just did:

We named our function add() in the same way that we’ve saved outputs to variables in previous labs, by using the <- symbol
We indicate that we are creating a user-defined function by using the word function
The function inputs (these are also called arguments) are provided in the parentheses immediately following the word function, i.e. number1 and number2 in function(number1, number2)
The code that the user-defined function will run is put between two curly braces, { and }.
The region between { and } is known as the function body, and you write code within the function body in the same way that you write code in an R Markdown code block
The inputs, in this case number1 and number2, should be used somewhere in the function body
When you run add(4, 6), the function automatically knows to substitute 4 wherever number1 is and 6 wherever number2 is, so the line,
```
result <- number1 + number2
```
becomes
```
result <- 4 + 6
```

What if we wanted to access the result of add(4, 6), which is 10? Maybe after we run add(4, 6) the variable result becomes defined. In your Console window, type add(4, 6), then type result. I suspect that you’ll see the following:

> add(4, 6)
4 plus 6 equals 10 
> result
Error: object 'result' not found

So it seems like the variable result is “forgotten” after add() is run. Instead, maybe we could try to save the result variable using <- symbol. In your Console window, type,

add_result <- add(4, 6)

then type add_result. When you do this, I suspect you’ll get a NULL instead of a 10.

As you now see, the add() function, as written, is not allowing us to access the value of 10 when we try to save it. How do we fix this? It turns out that the fix is straightforward, we just need to put the word result on the last line of the function body:

add_fix <- function(number1, number2) {
  result <- number1 + number2
  cat(number1, "plus", number2, "equals", result, "\n")
  result
}

Let’s try it out.

Copy and paste the code that defines the function add_fix() into your R Markdown file. Then, create a second code block and run
```
add_result <- add_fix(4, 6)
```
Check the value inside of add_result to confirm that it is equal to 10.

This demonstrates how we are able to return values from user-defined functions that we can save for later use. The last thing we execute within the function body is what the function returns, and whatever is returned is what we can save for later use with the <- symbol. All other variables used in the function body are forgotten.

If you still don’t feel like you could make a user-defined function on your own without guidance, that’s okay! It takes time and practice to get used to them. Since this is still a new concept, you’ll continue to be given hints and provided with templates to help you out.

Let’s consider another example based on the mpg dataset. If you were given this dataset for a lab, you might be asked to create a density plot for the hwy, cty, and displ variables. Up until now, you would create three different code blocks to complete this task that might look like this:

```{r}
ggplot(data = mpg) +
  geom_density(
    mapping = aes(x = hwy)
  )
```

```{r}
ggplot(data = mpg) +
  geom_density(
    mapping = aes(x = cty)
  )
```

```{r}
ggplot(data = mpg) +
  geom_density(
    mapping = aes(x = displ)
  )
```

These three blocks of code are nearly identical, with the only difference between them being the input for the aes() function. This seems like a good candidate for a user-defined function! If I asked you to build one yourself based on what we’ve seen so far, you might try this:

# SPOILER ALERT: This function won't work
mpg_density_plot <- function(variable) {
  ggplot(data = mpg) +
    geom_density(
      mapping = aes(x = variable)
    )
}

If you defined this function and then ran mpg_density_plot(hwy), instead of getting a visualization, you’ll get the disappointing message:

Error in FUN(X[[i]], ...) : object 'hwy' not found

Like so many other cases, tidyverse functions have their own way of doing things compared to regular R, and this includes how you can specify unquoted variables in a user-defined function. Without special handling, what R will do here is check to see if hwy is a variable that stores a value. If it doesn’t, then it just prints an error message and stops. If you want to be able to run mpg_density_plot(hwy) without getting an error message, you need to modify the function definition to the following:

# Unlike before, this function will now work
mpg_density_plot <- function(variable) {
  user_input <- rlang::enquo(variable)
  ggplot(data = mpg) +
    geom_density(
      mapping = aes(x = !!user_input)
    )
}

Let’s try this out.

Copy and paste the code that defines the corrected mpg_density_plot function into a code block and run the block. Then, create three more code blocks and use mpg_density_plot to create density plots for the hwy, cty, and displ variables.

The line user_input <- rlang::enquo(variable) converts the word you give as input to mpg_density_plot into a special format that is used in the tidyverse packages. When that word later needs to be used to specify a column name in a dataset, you put two exclamation points in front of it, i.e. aes(x = !!user_input). This would also be what you would have to do if you wanted to use a command from dplyr, such as select(),

mpg_select <- function(column) {
  user_input <- rlang::enquo(column)
  mpg %>%
    select(!!user_input)
}

This allows us to select a single column like so:

mpg_select(hwy)

hwy
29
29
31
30
26
26
…

Again, don’t worry right now if this business with rlang::enquo() and !! doesn’t make total sense to you. This example was meant to show you what is required if you want to specify column names in the inputs to your user-defined function. That way you have a reference to come back to in case you ever need to do this!

Function to read a data file

Now that we’ve learned a bit about what functions are, let’s take what we did in the extracting information from a file name and read a data file, label the dataset sections and combine this code into a single function.

The structure for our user-defined function read_data_file() is as follows:
```
read_data_file <- function(data_file) {
  # file_state <- Code to get two-letter state/country code from filename
  # file_city <- Code to get city names from filename
  # col_names <- Code to list column names
  # temperature_data_frame <- Code to read and label the data

  temperature_data_frame
}
```
Using the results of Exercises 8, 9, and 10, replace the first four line comments in the function body with working code. When the function is implemented correctly, running read_data_file(path(data_dir, "ALHUNTSV", ext = "txt")) should produce the same output as running alhuntsv in the Console pane.

Read all the data files

Now that we have read_data_file() defined, we’re ready to read and combine the rest of the data files together into one large data frame. To do this, we will need a list of all the data files in the data/ folder. The fs package, like usual, provides us with a convenient function, the dir_ls() function, that lists of all the files in a given folder. To list our files, we simply need to run,

data_files <- data_dir %>%
  dir_ls(glob = "*.txt")

The glob = “*.txt” input tells dir_ls() to only list files ending with .txt, which are our temperature data files.

Now we need a way to apply read_data_file() to every file listed in data_files, which will produce a series of data frames containing the temperature data for every location. We then need to take those data frames and bind all the rows together to create a single data frame we can use. Lucky for us, tidyverse loads a very convenient function that can do exactly this, the map_dfr() function,

temperature_df <- data_files %>%
  map_dfr(read_data_file) %>%
  mutate(
    temperature = if_else(
      near(temperature, -99),
      as.numeric(NA),
      temperature
    )
  )

Copy and run the code for creating data_files and then temperature_df. Verify that temperature_df has more than 2.7 million rows.

The mutate() function is creating a new column called temperature and is using the if_else() function to replace any temperature values equal to -99 with NA. Explain why we want to do this.

Hint

If you are having trouble remembering what a value of -99 means, go back and read the about the dataset section.

Explore your new temperature database

Now that we have all our data in one place, we should take a look at what’s in it. First, let’s remove the NA values in the temperature column, as those rows represent missing measurements and so we don’t need to keep them for our analysis. We should also remove the year 2018 from the dataset, as the data collection for the year is still in progress.

Use filter() and is.na() to remove the NA values from the temperature column. Then use filter to remove data for the year 2019. Assign the result to temperature_df_filtered.

Let’s start with a query for the temperature data for Washington, DC.

Filter temperature_df_filtered to only contain the temperature data for Washington, DC. To figure out the filename for Washington, DC, check here: http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm. Assign the filtered dataset to washdc.

Let’s create a visualization showing the average daily temperatures as a function of time for Washington, DC.

Make a scatter plot of your washdc dataset using date for the horizontal axis and temperature for the vertical axis. After you’ve created the plot, explain why you see the daily average temperatures consistently oscillating between higher and lower temperatures as a function of time.

Let’s see if the average and median temperatures per year in Washington, DC have stayed consistent.

Take washdc and group by the year column. Then, within this group, calculate both the mean (average) temperature and the median (middle) temperature. Assign this result to a variable called washdc_year. Then, create two separate scatter plots, one plotting year along the horizontal axis and the mean temperature per year along the vertical axis and the other plotting year along the horizontal axis and the median temperature per year for the vertical axis. Also include geom_smooth() on each of these plots. Describe the trends you see.

Finally, let’s take a more global view of the data and look at the average and median temperatures per year across the entire dataset.

Take temperature_df_filtered and group by the year column. Then, within this group, calculate both the mean (average) temperature and the median (middle) temperature. Assign this result to a variable called world_year. Then, create two separate scatter plots, one plotting year along the horizontal axis and the mean temperature per year along the vertical axis and the other plotting year along the horizontal axis and the median temperature per year for the vertical axis. Also include geom_smooth() on each of these plots. Describe the trends you see.

There’s plenty more you could do with this dataset, but that’s enough for now.

How to submit

To submit your lab, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 10 posting on Blackboard.

Cheatsheets

You are encouraged to review and keep the following cheatsheets handy while working on this lab:

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by James Glasbrenner for CDS-102.