CDS 101: Introduction to Computational and Data Sciences

Instructions

Obtain the GitHub repository you will use to complete the mini-homework, which contains a starter file named pressure.Rmd. This mini-homework will guide you through the process of building and analyzing linear regression models. Read the About the dataset section to get some background information on the dataset that you’ll be working with. Each of the below exercises are to be completed in the provided spaces within your starter file pressure.Rmd. Then, when you’re ready to submit, follow the directions in the How to submit section below.

About the dataset

The dataset¹ we are using in this mini-homework comes from a study conducted by anthropologists to determine the long-term effects of a change in environment on blood pressure. They measured the blood pressure of individuals who had migrated from a pre-modern environment high in the Andes mountains of Peru into the mainstream of Peruvian society at a much lower altitude. The motivation for looking at blood pressure for this group of individuals comes from a prior study in Africa, which suggested that migration from a simpler society to a more technologically complex one might increase blood pressure at first, but that the blood pressure would tend to decrease back to normal over time. The anthropologists recorded systolic blood pressure measurements as it is often a more sensitive indicator compared to diastolic measurements, as well as each participant’s height, weight, and a number of other characteristics. All these data are for males over 21 who were born at a high altitude and whose parents were born at a high altitude.

Using this collected data, we aim to answer the research question,

All else kept equal, does the amount of time spent in an urban environment affect the blood pressure of individuals that recently migrated from a pre-modern society?

You might wonder if the features in the infer package would be sufficient to address this question. The infer package could be used to determine whether or not time spent in an urban environment has any effect at all, such that we could conclude “yes it does” or “no it doesn’t.” What it won’t let us do is predict the blood pressure of another individual that did not participate in the study or help to quantify how much of an effect time spent in an urban environment has relative to the other measurements taken during the study. If we want to predict things based on what we see in the dataset and analyze the relative importance of different variables, we will need to build a model. There are many types of models out there, and for this mini-homework we will focus on building a linear regression model for this dataset.

The dataset blood_pressure contains 9 variables and 39 rows,

Variable	Description
Systol	systolic blood pressure
Age	age
Years	years in urban area
Weight	weight (kg)
Height	height (mm)
Chin	chin skinfold
Forearm	forearm skinfold
Calf	calf skinfold
Pulse	resting pulse rate

Exercises

As always, we begin by creating visualizations to help familiarize ourselves with the dataset. Since we are interested in how the variable Systol is affected by the other variables in the dataset, let’s use facet_wrap() to create a sequence of scatter plots that will show us which variables correlate with Systol. Fill in the ellipses … in the following template to create the visualization,
```
blood_pressure %>%
  gather(..., key = "variable", value = "value") %>%
  ggplot() +
  geom_point(mapping = aes(x = ..., y = ...)) +
  facet_wrap(~ ..., scales = "free_x")
```
The missing input in gather() should grab all the columns in the range Age through Pulse. In the plot, fill in the missing inputs so that the vertical axis is Systol, the horizontal axis is the value of the other variable, and the faceting is done over the different variable names.
Does the Years variable have a positive or negative correlation with Systol? Are there any other variables that follow a similar trend with respect to Systol? Which variables show a moderate to strong positive correlation with Systol?

Reminder

The Years variable is the number of years that an individual has lived in an urban environment.
A common preprocessing step in model building is to combine or transform some of the variables to create additional columns that may improve the accuracy of a model. Our research question asks whether the length of time spent living in an urban environment correlates with decreased blood pressure over time. To figure this out, we could just look at Years, but it may be that any trends in the Years variable are confounded by the overall age of the individual. Account for the effect of age by using mutate() to create a new column by computing the ratio \(\text{urban\_frac\_life}=\dfrac{\text{Years}}{\text{Age}}\). Assign the resulting data frame with these two new columns to blood_pressure_updated.
We’re now ready to create our first linear regression model using the lm() function! Try the following as an example, which builds a linear regression model of Systol using urban_frac_life as the explanatory variable,
```
systol_urban_frac_model <- lm(Systol ~ urban_frac_life, data = blood_pressure_updated)
```
After building the model, use the following two convenience functions from the broom package to get a basic overview of the model. To get a data frame summarizing the model parameters, use the tidy() function,
```
systol_urban_frac_model %>%
  tidy()
```
The values under the estimate column give us the intercept and slope of our linear model (remember \(y=mx+b\)?). Additional information about the model, such as the model’s \(R^{2}\) paramter, can be obtained using the glance() function:
```
systol_urban_frac_model %>%
  glance()
```
The \(R^{2}\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. If \(R^{2}\) is close to 1, that means that the model is doing an excellent job capturing the variability of the response variable. If it is close to 0, then it is doing a poor job capturing the variability of the response variable.

Reminder

Notice that we are using the same formula syntax in lm() to specify the model as we used in the specify() function from the infer package. Remember, the response variable goes on the left side of the tilde ~ and the explanatory variable goes on the right side.
After building a model, we would like to know what it predicts and how accurate the predictions are. One way to capture the accuracy of the predictions is to calculate the residuals, which tell us how far off any given prediction is from the true value. The modelr package, which is part of the tidyverse, provides us with functions for adding model predictions and residuals to our data frame. To do this for our model Systol ~ urban_frac_life, we pipe our dataset into two functions, add_predictions() and add_residuals(), which each take the model systol_urban_frac_model that we built earlier as input,
```
systol_urban_frac_df <- blood_pressure_updated %>%
  add_predictions(systol_urban_frac_model) %>%
  add_residuals(systol_urban_frac_model)
```
The predictions are added under a new column called pred and the residuals are added under a new column called resid.
Now that we’ve made our predictions and calculated the residuals, let’s create some visualizations to investigate how well our model performs. First, let’s plot the model fit on top of the Systol versus urban_frac_life scatter plot. To create this visualization, we will need the slope and intercept we printed out in Exercise 4, which should be filled in the template code below,
```
ggplot(systol_urban_frac_df) +
  geom_point(mapping = aes(x = ..., y = ...)) +
  geom_abline(slope = ..., intercept = ...)
```
As before, you will need to fill in the ellipses … to create the plot.
The one problem with the plot in Exercise 6 is that it can only be made when your model has a single explanatory variable. If you create a model with two or more explanatory variables, you will need a different type of plot to visualize the predictive power of your model. One such plot is the observed versus predicted plot, which is created as follows,
```
ggplot(systol_urban_frac_df) +
  geom_point(aes(pred, Systol)) +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "red",
    size = 1
  )
```
Alternatively, you can use a residual versus predicted plot,
```
ggplot(systol_urban_frac_df) +
  geom_point(aes(pred, resid)) +
  geom_ref_line(h = 0)
```
Make both of these plots in your R Markdown file.
It is also important to inspect the residuals to see what the prediction error is like. It is typical to visualize the residuals using a histogram to get a sense of their center, shape, and overall spread. Create a histogram of the resid column in systol_urban_frac_df, making sure you choose an appropriate bin width for the distribution. What is the shape and center of the residuals?
How do we know if a linear regression model is reliable? In general, three conditions must be met in order for a linear model built using lm() to be reliable:
1. Linearity: The relationship between the explanatory variable and the response variable must be linear
2. Nearly normal residuals: The residuals should be nearly normal (i.e. follow a bell curve shape)
3. Constant variability: The variability of the points around the model line should be roughly constant
The first and third conditions, linearity and constant variability, can be determined from the plots created in Exercise 7. Look at your observed versus predicted and residual versus predicted plots again. If the points in either plot appear to follow a non-linear (curved) trend, then that’s a tell-tale sign that the condition for linearity has been violated. Similarly, in the residual versus predicted plot, if the residual spread seems to meaningfully increase or decrease as the predicted value changes then the constant variability condition has been violated. Interpret the plots and conclude whether or not the relationship between Systol and urban_frac_life is linear or non-linear and whether constant variability is violated.
The histogram we created in Exercise 8 can give us a basic idea of whether the second condition, nearly normal residuals, is violated, but we should have a more precise method for figuring this out. A useful method for determining if the residuals are nearly normal is to build a Q-Q plot using geom_qq(), which creates a plot that shows us precisely where the distribution of residuals deviates from normality. To create the Q-Q plot of the residuals, run the following code,
```
ggplot(data = systol_urban_frac_df) +
  geom_qq(mapping = aes(sample = resid)) +
  geom_qq_line(mapping = aes(sample = resid))
```
Generally, if we have a nearly normal distribution then most of the residuals will stay close to the reference line on the Q-Q plot. Regions where the residual distribution deviates from the normal distribution’s bell curve shape will show up as points that diverge from the reference line. Based on what you see in the visualization, conclude whether or not the residuals are nearly normal.
Create and fit a second linear model for Systol using the Weight variable and name it systol_weight_model. Does this new model seem to predict Systol better than urban_frac_life? Determine this by comparing the \(R^{2}\) values (obtained using glance()) for the two models. Then, create an observed versus predicted plot, a residual versus predicted plot, and a Q-Q residual plot for this new model. Should we conclude that this new model is reliable? Why or why not?
Now let’s see what happens when we build a multivariate model. Build a model for Systol that uses urban_frac_life and Weight at the same time,
```
systol_combo_model <- lm(Systol ~ urban_frac_life + Weight, data = blood_pressure_updated)
```
Create a residual versus predicted plot and a Q-Q residual plot for this model. Should we conclude that this new model is reliable? Also inspect the \(R^{2}\) value for this new model. Does this seem to perform better than the single-variable models?

How to submit

To submit your mini-homework, follow the two steps below. Your homework will be graded for credit after you’ve completed both steps!

Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Mini-homework 8 posting on Blackboard.

Cheatsheets

You are encouraged to review and keep the following cheatsheets handy while working on this mini-homework:

Peruvian blood pressure dataset and description sources: https://newonlinecourses.science.psu.edu/stat501/node/299/, https://www2.stat.duke.edu/courses/Fall98/sta113/proj/datasets.html.↩