CDS 101: Introduction to Computational and Data Sciences

Instructions

Obtain the GitHub repository you will use to complete the mini-homework, which contains a starter file named tidy_gradebook.Rmd. This mini-homework will help you become more familiar with reshaping datasets using R by guiding you through some examples. Read the About the dataset section to get some background information on the dataset that you’ll be working with. Each of the below exercises are to be completed in the provided spaces within your starter file tidy_gradebook.Rmd. Then, when you’re ready to submit, follow the directions in the How to submit section below.

About the dataset

The dataset we are working with in this mini-homework is the gradebook dataset, a synthetic spreadsheet of student grades for a fictitious course. The gradebook data does not fulfill the tidy data principles, but illustrates a way that an instructor might implement his or her gradebook in spreadsheet software. Load the dataset by running the setup block at the top of your R Markdown file, then inspect the dataset by running View(gradebook) in the Console window.

This dataset breaks all 3 of the tidy data rules:

Each variable must have its own column.
- The student names, assignment categories, assignment names, and grades are the variables in the dataset. Each individual student should not be considered a separate variable.
Each observation must have its own row.
- A single observation in this dataset should be a single grade given to a student for a specific assignment. Instead of one grade per row, there are multiple grades per row.
Each value must have its own cell.
- The student names can be split out into first and last name columns instead of keeping it as a full name.

The student names were generated using the site: http://listofrandomnames.com. Any similarity to real names is coincidental.

Exercises

Our main goal in the exercises below will be to use the tidyr functions to reshape the dataset so that it fulfills the tidy data principles. Afterwards, we will analyze the dataset.

To fulfill the tidy data principles, we need to use the gather() function to convert the student columns into rows so that each row contains a single grade. A template code block for doing this is provided below,
```
gradebook_long <- gradebook %>%
  gather(
    <students>,
    key = "<key>",
    value = "<value>"
  )
```
To get the code to work, you need to replace <students> with a list of the student names (don’t forget to also remove the angular brackets <> when filling in your answer). Then, <key> and <value> need to be replaced so that the new column containing the student names is called Name and the new column containing the assignment grades is called Grades.
Hints
- Refer to section 12.3.1 of the R for Data Science textbook for additional clues on how to use the gather() function: https://r4ds.had.co.nz/tidy-data.html#gathering
- The names for the student columns contain spaces, so you will need to use backticks when specifying them, for example `Jermaine Gautreau`.
- You shouldn’t need to list all the student names manually. Section 5.4 of R for Data Science contains a hint for how you can specify a range of column names: https://r4ds.had.co.nz/transform.html#select
The Name column that we created in exercise 1 technically contains two values per cell, i.e. each student’s first name and last name. We can fix this by using the separate() function to split the column into two columns. A template code block for doing this is provided below,
```
tidy_gradebook <- gradebook_long %>%
  separate(
    col = <column to separate>,
    into = combine("<name for first column>", "<name for second column>"),
    sep = "<separator>"
  )
```
To get the code to work, you need to replace <column to separate> with the name of the column that you wish to split into two columns (like before, don’t forget to also remove the angular brackets <> when filling in your answer). Then, <name for first column> and <name for second column> need to be replaced so that the new column of first names is called First Name and the new column of last names is called Last Name. Finally, <separator> needs to be replaced with whatever symbol/letter/etc that separates the first and last name of each student.

After you’ve obtained your tidy gradebook, take a look at the first several rows using the following code block:
```
tidy_gradebook %>%
  head()
```
The data frame we obtain in exercise 2 fulfills the tidy data principles, but it does not resemble how gradebooks are presented. For example, gradebooks in Blackboard have each row correspond to a student and each column correspond to an assignment. To represent our gradebook in this format, we need to use the spread() function to convert the rows of the Assignment column into their own columns. Let’s do that now for the sake of practice.

The first thing we need to do is drop the Category column, which is typically stored as metadata and not visible in gradebooks. Replace the ellipses … in the code template below so that the Category column is removed and all the other information remains.
```
gradebook_no_categories <- tidy_gradebook %>%
  select(...)
```
Now we are ready to use spread() to convert the rows in the Assignment column into multiple columns. A template code block for doing this is provided below,
```
blackboard_gradebook <- gradebook_no_categories %>%
  spread(
    key = <key>,
    value = <value>
  )
```
To get the code to work, you need to replace <key> with the name of the column that contains the names of the columns you will be creating (this means it must be a column containing categorical data). Then, <value> will be replaced with the name of the column that contains the values (the grades) that are to be placed under each of the columns.

After you’ve obtained your “Blackboard-style” gradebook, take a look at the first several rows using the following code block:
```
blackboard_gradebook %>%
  head()
```
Now that we have a tidy gradebook, let’s look at the grade distribution for one of the homework assignments, Homework 5. First, you will need to need to filter the dataset so that it only contains entries for Homework 5. A template code block for doing this is provided below,
```
homework5_grades <- tidy_gradebook %>%
  filter(...)
```
Replace the ellipses … so that homework5_grades only contains rows with grades for Homework 5.

After you successfully filter the dataset, create a histogram of the grades stored in homework5_grades. The histogram should have a binwidth of 10 and be centered around 0 (use center = 0). You should be familiar with creating histograms by now, so no template for this one!
Having a tidy dataset makes computing the final grade for each student in the course straightforward. Let’s assume that the category weights are the following:

Category Weight

Homework 30%

Quiz 20%

Midterm Exam 25%

Final Exam 25%

In order to use this information to compute the final grade for each student, we first need to create a data frame to store the category weights. A template code block for doing this is provided below,
```
grade_weights <- tibble(
  Category = combine("Homework", ...),
  Weight = combine(0.30, ...)
)
```
The first row has been filled in for you. Replace the ellipses … so that the rest of the categories and weights are present in the grade_weights data frame. The spelling and capitalization in the Category column must exactly match the above table.

Once you have grade_weights created, we need to import that information into tidy_gradebook. We can do that using the left_join() function as follows:
```
tidy_gradebook_with_weights <- tidy_gradebook %>%
  left_join(grade_weights, by = combine("Category"))
```
Copy and paste this into your R Markdown notebook and verify that tidy_gradebook_with_weights now contains a column called Weight with the category weights.
Now that we’ve imported the category weights into our gradebook, we can continue with computing the final grade for each student. Our next step in this calculation is to multiply the category weights and grades columns together using mutate(). A template code block for doing this is provided below,
```
weighted_grades <- tidy_gradebook_with_weights %>%
  mutate(`Weighted Grade` = ...)
```
Replace the ellipses … so that the values in the Grade and Weight columns are multiplied together.

Category	Weight
Homework	30%
Quiz	20%
Midterm Exam	25%
Final Exam	25%

Now that we’ve weighted the grades, we can now use group_by() and summarize() to compute the average weighted grade for each student and in each category. A template code block for doing this is provided below,

grades_per_category <- weighted_grades %>%
  group_by(`First Name`, `Last Name`, Category) %>%
  summarize(`Category Grade` = ...)

Replace the ellipses … so that the average of the weighted grades is computed for each student within each grade category.

Finally, to compute the final grade, use summarize() to sum the average weighted grades per category together. A template code block for doing this is provided below,

final_grades <- grades_per_category %>%
  summarize(`Final Grade` = ...)

Replace the ellipses … so that the average weighted grades per category as summed together.

Once you’ve computed the final grades, display the first few rows using the following code:

final_grades %>%
  head()

Confirm that the first six rows are the following:

First Name	Last Name	Final Grade
Anthony	Capote	87.22143
Bryant	Criddle	90.60000
Cherie	Maiden	71.17143
Christiana	Deblois	83.22857
Cleveland	Fromm	83.49286
Cliff	Vankeuren	79.81429

Now that we have the final grades, let’s use a point plot similar to what is shown in Chapter 15.4 of the textbook, http://r4ds.had.co.nz/factors.html#modifying-factor-order, and review how the students did:
```
ggplot(final_grades) +
  geom_point(
    mapping = aes(
      x = `Final Grade`,
      y = fct_reorder(`Last Name`, `Final Grade`)
    )
  ) +
  labs(
    x = "Final Grade",
    y = "Student Name"
  ) +
  coord_cartesian(xlim = combine(60, 100))
```
The figure’s length and width may look distorted by default. To fix this, add fig.asp = 0.85 as an input to your code block,
```
```{r, fig.asp = 0.85}
```

How to submit

To submit your mini-homework, follow the two steps below. Your homework will be graded for credit after you’ve completed both steps!

Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Mini-homework 5 posting on Blackboard.

Cheatsheets

You are encouraged to review and keep the following cheatsheets handy while working on this mini-homework: