Lab 02 – Exploration by visualization: the Galton dataset
Instructions
Obtain the GitHub repository you will use to complete the lab, which contains a starter file named lab02.Rmd. This lab shows you how to use the ggplot2 package to visualize datasets and demonstrates how visualization plays a crucial role in data exploration. Start by reading Why data visualization? to learn why data visualization is important and why the first lab focuses on it. Next, read the About the dataset section to get some background information on the dataset that you’ll be working with. Continue reading the lab instructions and complete any exercises using the provided spaces within your starter file lab02.Rmd. Then, when you’re ready to submit, follow the directions in the How to submit section below.
Why data visualization?
Why is data visualization an important topic? On the face of it, you might wonder why we need to dedicate any time to this topic. Aren’t plots really easy now that we all have computers? Isn’t making plots and figures one of the last things that we do for a project or lab report, after we’ve figured everything out? Why start with this? Since a picture (or visualization) is worth a thousand words, take a moment to explore the data visualizations linked below.
- After taking a few minutes to explore these visualizations, write at least one full sentence that explains one thing you noticed about either visualization that makes it effective at conveying information.
Visualizations have an important role to play in nearly every stage of a data science project. High-quality visualizations help people to understand your results and can activate their curiosity about your work and ideas. Creating visualizations in R is also easy and fun, and learning how to make them will help you become more comfortable with using R and RStudio. You will quickly see how simple it is to make colorful and eye-catching plots!
About the dataset
You will be exploring the famous dataset by Francis Galton for this week’s lab on data visualization. The data were transcribed by J.A. Hanley [1] who has published them at http://www.medicine.mcgill.ca/epidemiology/hanley/galton/. Francis Galton was developing ways to quantify the heritability of traits in the 1880s, and as part of this work he collected data on the heights of 898 adult children and their parents.
The table below provides descriptions of the dataset’s 6 variables,
Variable | Description |
---|---|
family | a category with levels for each family |
father | the father’s height (in inches) |
mother | the mother’s height (in inches) |
sex | the child’s sex: F or M |
height | the child’s height as an adult (in inches) |
nkids | the number of adult children in the family, or, at least, the number whose heights Galton recorded. |
Visualization by example
How do I load the dataset?
The dataset should be automatically loaded into the variable galton for you in the R Markdown file for your lab report. To explore the dataset, type the following in your Console window,
View(galton)
If running this does not show you a data table, make sure you run the setup code block at the top of your R Markdown file first!
Before we discuss the general format for creating ggplot2 plots, let’s play around with some examples:
In your lab report, create an R code block that contains the following code:
ggplot(data = galton) + geom_histogram( mapping = aes(x = height), bins = 30 )
To run the code, either click the green “play” button in the upper right corner of the R code block or, while your cursor is inside the code block, press Ctrl+Shift+Enter. This should create a plot called a histogram.
After creating the histogram, look at the height column in the data table you can view with
View(galton)
and compare it with the histogram. Then, describe what the histogram is doing with the data in this column.
The input parameter bins = 30 controls an important visual element in the histogram plot. Let’s experiment with the parameter in order to figure out what it does.
- Using the code you wrote in Exercise 2 as a starting point, try setting the input keyword bins equal to something larger than the number 30, and then equal to something smaller than the number 30. This will create two plots. Then, change the input keyword from bins to binwidth and set its value equal to 1. Compare the plots and write a conclusion about what the bins and binwidth inputs control.
It is simple to add additional arguments to the aesthetic input aes()
that change the way data are shown, which can reveal trends that were previously hidden from view. Let’s see what the fill argument does to our histogram:
Write the following code in your lab report:
ggplot(data = galton) + geom_histogram( mapping=aes(x = height, fill = sex), binwidth = 1, alpha = 0.5, position="identity" )
Run the block and look at the output. What did adding fill = sex as a new input in
aes()
do? Does this change the way you might interpret the visualization? What kinds of differences stand out now that we added this?
As you can now see, changing one of the inputs in your ggplot2 code can have a substantial effect on the way your visualization looks. When a visualization reveals new information, we should describe and interpret it in our lab reports.
- Describe the shape of the male and female height distributions and estimate the value each distribution seems to be centered around. Does there appear to be a tangible difference in the average height for these two distributions? Based on what you know about the relative heights of people, is this a result that you would have expected to see? Why or why not?
Based on what you’ve seen so far, do you understand what all of the inputs are doing? The exercise below guides you through the process of exploring how the different inputs affect the plot’s look:
A good way to figure out how R works is to experiment with inputs. What happens if you change the value of alpha = 0.5 (keep it between 0 and 1). What happens if you remove the input position = “identity”? What happens if you replace it with position = “dodge”? What does it change in your output?
Hint: When moving the input, be careful with the commas! A comma should separate each input.
The ggplot2 syntax
Let’s take a small break from making plots and review the general syntax for creating a ggplot2 figure. The command ggplot()
, as you might have figured out already, creates the plot window. Commands with the prefix geom_—such as geom_histogram()
—convert data points into different kinds of visualizations, and the command aes()
controls the aesthetic properties of each geom_ object. We then specify input parameters inside the parentheses of these commands, with each input separated from neighboring inputs by commas , and input names and values separated by the equals sign =. In general, visualizations in ggplot2 have a predictable structure for the commands you need to enter. The general pattern is:
ggplot(data = <data>) + # Required in all ggplot2 visualizations
geom_<function>(
mapping = aes(<mapping>), # Required in all ggplot2 visualizations
stat = <stat>, # Optional, sensible default chosen
position = <position> # Optional, sensible default chosen
) +
coord_<function> + # Optional, sensible default chosen
facet_<function> + # Optional, sensible default chosen
scale_<function> + # Optional, sensible default chosen
theme_<function> # Optional, sensible default chosen
In the above, anywhere you see a word <surrounded> in angular brackets, you can replace it with one of several choices. Bare minimum, you must always specify the first two functions, all the rest are optional and have sensible defaults chosen for you. Also note that each major category—geom_ objects, coord_ objects, facet_ objects, scale_ objects, theme_ objects—is added to the ggplot2 “sentence” using the plus sign +. A helpful way to think about this is that you are building a figure using a series of layers: first you lay down the canvas (ggplot), then you visualize the data on a second layer using a certain style of plot (geom_<function>), and afterward you tweak and polish the plot using additional layers to find-tune things. This layered approach allows you to create nice figures without much effort or the need to memorize dozens of commands.
Scatter-plots
Let’s further explore the data using another type of visualization, the scatter-plot.
Use the following code to create a scatter-plot of each person’s height as a function of their father’s height.
ggplot(data = galton) + geom_point( mapping = aes(x = father, y = height) )
Here, height is the response (dependent) variable and father would be the explanatory (independent) variable. Describe any trends that you see using full sentences.
Next, let’s try and create a plot that is similar in spirit to what we did in Exercise 4 so that we can see how the height variable depends on the father variable when the sex variable is taken into account. One important difference to know is that we need to use the word color as an input instead of fill. Otherwise, the procedure for grouping over the sex variable is basically the same.
- Figure out how to group the data over the sex variable using the color input and create a new plot. What does this plot tell you about the relationship between the height measurements and the sex variable?
Faceting
Let’s introduce another new concept, the facet. Facets create visualizations with multiple panels, splitting things up across a categorical variable. Let’s apply this to our scatter-plot that we created in Exercise 7.
Copy your code from Exercise 7 that created a scatter-plot. Add the following new command to your code snippet using the + sign:
facet_grid(. ~ sex)
Describe what you get as output. Then, create a new code block where you reverse the input for facet_grid like so:
facet_grid(sex ~ .)
What did adding facet_grid do to your output, and what does the order of . ~ sex versus sex ~ . seem to do? Is the information presented here any different from the information in Exercise 8?
Modeling in ggplot2
You can actually create regression models using geom_smooth, which is another handy way to look for trends. You can choose from one of several kinds of regression methods. Here, we’ll use linear regression for creating our models (perhaps you remember drawing one of these “lines of best fit” by hand in a prior science or math class).
Copy your code from Exercise 7 that created a scatter-plot. Add the following new command to your code snippet using the + sign:
geom_smooth( mapping = aes(x = father, y = height), method = "lm" )
Describe the line that you get as output. Does it follow the trends (if any) you’ve previously described in the data? What do you think the semi-transparent gray region around the line represents?
When creating a graph, you often want to make some touch ups after you’ve figured out what to plot. Do the following to add some extra polish to the plot you made in Exercise 10.
Copy your code from Exercise 10 and add the following command to your code:
labs( title = "Height of children as adults versus the height of their fathers", x = "Father's height (in)", y = "Adult child's height (in)" )
You should also alter the size of the scatterplot circles by adding size = 2 as an input in geom_point.
Additional exercises
Create a similar plot to Exercise 7 and make a scatter-plot of each person’s height as a function of their mother’s height. Describe any trends that you see using full sentences. If there’s a trend both here and in Exercise 7, does it follow the same general direction or does it not? If the trends move in the same direction, then does one trend look stronger than the other? If so, which one? Explain how you know this using full sentences.
Start with your code from Exercise 9 and add a geom_smooth command that separately performs a linear regression on the male and female categories. Each subplot (or facet) should have a line in it. Additionally, polish the graph using what you learned in Exercise 11. Interpret your results and explain how well these visualizations explain the trends in the dataset, if any.
Compare the geom_smooth trend-lines for when used the father’s height only and when you used both the father’s height and whether the child was male or female. Do they both show the same trend? Is one trend more “powerful” than the other? Why or why not? Remember to respond by writing full sentences.
How to submit
To submit your lab, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 2 posting on Blackboard.
Cheatsheets
You are encouraged to review and keep the following cheatsheets handy while working on this lab:
Credits
This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The description of the Galton dataset was adapted from the documentation of the same dataset that’s available in the mosaicData R package and some of the lab exercises were adapted from problem sets found in Modern Data Science with R by Benjamin Baumer, Daniel Kaplan, and Nicholas Horton. All other exercises and instructions written by James Glasbrenner for CDS-102.
References
[1] J. A. Hanley, “’Transmuting’ Women into Men: Galton’s Family Data on Human Stature,” The American Statistician 58, 237 (2004).