CDS 101: Introduction to Computational and Data Sciences

Instructions

Obtain the GitHub repository you will use to complete the mini-homework, which contains a starter file named analyzing_data_distributions.Rmd. This mini-homework will guide you through examples demonstrating how to analyze a dataset using standard statistical functions in R, ggplot2, and dplyr. Read the About the dataset section to get some background information on the dataset that you’ll be working with. Each of the below exercises are to be completed in the provided spaces within your starter file analyzing_data_distributions.Rmd. Then, when you’re ready to submit, follow the directions in the How to submit section below.

About the dataset

The dataset we are using in this mini-homework come from the article “Exploring Relationships in Body Dimensions” by Peterson, Johnson, and Kerk published in Journal of Statistics Education Volume 11, Number 2 (2003), and consists of body girth and skeletal diameter measurements as well as the age, weight, height, and sex for 507 physically active individuals, 247 men and 260 women. Load the dataset by running the setup block at the top of your R Markdown file, then inspect the dataset by running View(body_dims) in the Console window.

The table below provides descriptions of the dataset’s 25 variables,

Variable	Description
bia.di	respondent’s biacromial diameter in centimeters
bii.di	respondent’s biiliac diameter (pelvic breadth) in centimeters
bit.di	respondent’s bitrochanteric diameter in centimeters
che.de	respondent’s chest depth in centimeters, measured between spine and sternum at nipple level, mid-expiration
che.di	respondent’s chest diameter in centimeters, measured at nipple level, mid-expiration
elb.di	respondent’s elbow diameter in centimeters, measured as sum of two elbows
wri.di	respondent’s wrist diameter in centimeters, measured as sum of two wrists
kne.di	respondent’s knee diameter in centimeters, measured as sum of two knees
ank.di	respondent’s ankle diameter in centimeters, measured as sum of two ankles
sho.gi	respondent’s shoulder girth in centimeters, measured over deltoid muscles
che.gi	respondent’s chest girth in centimeters, measured at nipple line in males and just above breast tissue in females, mid-expiration
wai.gi	respondent’s waist girth in centimeters, measured at the narrowest part of torso below the rib cage as average of contracted and relaxed position
nav.gi	respondent’s navel (abdominal) girth in centimeters, measured at umbilicus and iliac crest using iliac crest as a landmark
hip.gi	respondent’s hip girth in centimeters, measured at at level of bitrochanteric diameter
thi.gi	respondent’s thigh girth in centimeters, measured below gluteal fold as the average of right and left girths
bic.gi	respondent’s bicep girth in centimeters, measured when flexed as the average of right and left girths
for.gi	respondent’s forearm girth in centimeters, measured when extended, palm up as the average of right and left girths
kne.gi	respondent’s knee diameter in centimeters, measured as sum of two knees
cal.gi	respondent’s calf maximum girth in centimeters, measured as average of right and left girths
ank.gi	respondent’s ankle minimum girth in centimeters, measured as average of right and left girths
wri.gi	respondent’s wrist minimum girth in centimeters, measured as average of right and left girths
age	respondent’s age in years
wgt	respondent’s weight in kilograms
hgt	respondent’s height in centimeters
sex	“male” for male respondents and “female” for female respondents

Exercises

Visualizations are a helpful tool for exploring a new dataset. As this dataset contains 25 variables, let’s start simple by inspecting the distribution of heights (hgt column) for the men and women (sex column) in the dataset. Create two separate plots, one where you use geom_histogram() and another where you use geom_density() to look at the distribution of heights for men and women.

Hint

The height distributions for men and women should be on the same plot. You will need to use the fill = and alpha = 0.5 inputs. Remember, alpha = 0.5 should not be inside the aes() function! geom_histogram() additionally requires a value for the position = input so that the distributions overlap one another.
There are advantages and disadvantages to using either geom_histogram() or geom_density() to represent a data distribution, so it can be preferable to show both types on the same plot. Take the code you wrote in Exercise 1 and try doing this yourself. The distributions should still be split by sex using fill = in both geom_histogram() and geom_density(). Remember, to overlay geoms you add them together like in the code template below (you will need to replace the ellipses …),
```
ggplot(...) +
  geom_histogram(aes(...), ...) +
  geom_density(aes(...), ...)
```
Does the plot look okay, or do you notice a problem?
The main issue you should have noticed in Exercise 2 is that the vertical axis scale for the histogram and density plots do not match. This is because the bar heights in the histogram count the number of data points that fall within a given range of values while the density curve’s height is a fraction, and so they are not directly comparable. You can fix this discrepancy by normalizing the bar heights in the histogram by dividing the height of each bar by the number of data points within each distribution (the bars in the men’s distribution are divided by 247, the women’s distribution are divided by 260). Luckily for us, there is a convenient way to do this using ggplot2, just add the input y = ..density.. inside the aes() function of geom_histogram(). This converts the histogram into a probability mass function (PMF).

Try converting your histogram from Exercise 2 into a PMF. Copy and paste your code and then add y = ..density.. inside the aes() function of geom_histogram().

Tip

Using a PMF or density curve to represent your distribution allows you to directly compare groups of data containing different numbers of observations.
Yet another way to evaluate and compare distributions is the cumulative distribution function (CDF), which is a function that maps from a value to its percentile rank. Since the terms percentile rank and cumulative distribution function may be new to you, pause here and read sections 4.2.2 (http://book.cds101.com/cumulative-distribution-functions.html#percentiles) and 4.2.3 (http://book.cds101.com/cumulative-distribution-functions.html#cdfs) of the Introduction to Computational and Data Sciences supplemental textbook from this module’s reading assignment.

Now that you have a better idea of what percentile ranks are and how they connect with the CDF, let’s plot the CDF for men’s and women’s heights. A template code block for doing this is provided below,
```
ggplot(body_dims) +
  geom_step(
    mapping = aes(...),
    stat = "ecdf"
  ) +
  labs(y = "CDF")
```
Replace the ellipses … in aes() to plot a CDF for men’s and women’s heights on the same plot.

Hint

The aes() inputs should be similar to what you used in geom_density() in Exercise 1, except you should use color = instead of fill = .
The CDF is a bit like a fingerprint for data distributions, as different classes of distributions have different characteristic shapes. CDFs are also convenient when you want to compare two data distributions within a dataset and determine if they share a similar shape and center or if they deviate from one another. Does the shape of the CDF curve for men’s and women’s heights look the same or different? Put another way, if we shifted the men’s curve to the left, would it mostly overlap with the women’s curve, or would there be deviations?
Up until this point in the course, you’ve been asked to discuss the features of data distributions after looking at plots, as it is important that you conceptually understand what the terms center, spread, and shape mean. We are now ready to start quantitatively analyzing our data distributions using statistical measures. The following R functions will be useful for computing basic statistical measures,
- mean(): Computes the average (measure of distribution’s center)
- median(): Computes the median (measure of distribution’s center)
- min(): Finds the minimum value (related to distribution’s spread)
- max(): Finds the maximum value (related to distribution’s spread)
- sd(): Computes the standard deviation (related to distribution’s spread)
- IQR(): Computes the interquartile range, which is the difference between the 75th and 25th percentiles (related to distribution’s spread)
We can use summarize() to apply these functions to a column in our dataset. Copy the code below into your R Markdown notebook and run it.
```
body_dims %>%
  summarize(
    mean = mean(hgt),
    median = median(hgt),
    min = min(hgt),
    max = max(hgt),
    sd = sd(hgt),
    iqr = IQR(hgt)
  )
```
Of course, these measures are being computed for the entire hgt column, with no distinction between men and women. Copy and paste the code into a second code block and insert a group_by(...) function before summarize() so that the measures are computed separately for men’s and women’s heights. You will need to replace the ellipses … with a variable name.
Let’s now move beyond hgt column and consider all the measurements in the column range bia.di:wri.gi. As it would be tedious to type out each column name individually, let’s use gather() to turn these columns into rows. Use the name body_part for the column containing the variable names bia.di through wri.gi and use the name measurement for the column containing numerical values. Assign this data frame to a variable called body_dims_long.
Visualize the columns we just converted into rows as PMFs and density curves, just like in Exercise 3. Continue to separate the distributions by sex using the fill = input. Use facet_wrap(~ body_part, scales = "free_x") to separate the body part measurements into separate plot windows. Are the distribution shapes mostly similar or mostly different across different body part measurements?
Hint

Adjust the size of the figure by adding the following options to your code block
```
```{r, fig.asp = 1.1, fig.width = 8}
```
Copy and paste the code from Exercise 8 and modify it so that each plot window (facet) shows the CDF for each data distribution. Refer to Exercise 4 for ideas on what you might need to change in the code. Based on the CDF plots, are there any body part measurements where the distributions for men and women are nearly indistinguishable from one another? If so, which ones?
Compute summary statistics for the different body part measurements. Group body_dims_long by body_part and sex using group_by(), then pipe this into a summarize() like the one in Exercise 6 (you will need to replace hgt with a different column name).

How to submit

To submit your mini-homework, follow the two steps below. Your homework will be graded for credit after you’ve completed both steps!

Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Mini-homework 6 posting on Blackboard.

Cheatsheets

You are encouraged to review and keep the following cheatsheets handy while working on this mini-homework: