This problem set will use the ggplot2
package to generate graphics. “The Grammar of Graphics,” is the theoretical basis for the ggplot2
package. Much like how we construct sentences in any language by using a linguistic grammar (nouns, verbs, etc.), the grammar of graphics allows us to specify the components of a statistical graphic.
In short, the grammar tells us that:
A statistical graphic is a
mapping
ofdata
variables toaes
thetic attributes ofgeom
etric objects.
A graphic can be broken into three essential components:
data
: the data-set comprised of variables that we plotgeom
: the type of geom
etric objects visible in a plot (points, lines, bars, etc.)aes
: aesthetic attributes of the geometric object that one perceives on a graphic. For example, x/y position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data-set.Type complete sentences to answer all questions inside the answer
tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R
code to report numeric answers inside the answer
tags (i.e. do not hard code your numeric answers).
Once you have opened the document:
author: "John Smith"
).R Packages are like apps on a cell phone - they are tools for accomplishing common tasks. R is an open-source programming language, meaning that people can contribute packages that make our lives easier, and we can use them for free. For this problem set, the following R packages will be used:
dplyr
: for data wranglingggplot2
: for data visualizationreadr
: for reading in dataThes above packages are already installed on Appalachian’s R Studio Server. Every time you open a new R session you need to load (open) any packages you want to use. Loading a package is done with the library()
function.
library(dplyr)
library(ggplot2)
library(readr)
Remember, “running code means” telling R “do this”. You tell R to do something by passing code through the console. You can run existing code many ways:
Control-Enter
on a PC or Command-Return
on a Mac (easiest method 2)Today, we will practice data visualization using data on births from the state of North Carolina. The code below reads a *.CSV
file from a URL into the object nc
.
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTm2WZwNBoQdZhMgot7urbtu8eG7tzAq-60ZJsQ_nupykCAcW0OXebVpHksPWyR4x8xJTVQ8KAulAFS/pub?gid=202410847&single=true&output=csv"
if(!file.exists("./nc.csv")){
download.file(url, destfile = "./nc.csv")}
nc <- read_csv("./nc.csv")
The data set that displays in your Environment is a large data frame. Each observation or case is a birth of a single child.
The workspace area in the upper right hand corner of the R Studio window should now list a data set called nc
with 800 observations (rows or cases) and 13 variables (columns).
You can see the dimensions of this data frame (# of rows and columns), the names of the variables, the variable types and the first few observations using the glimpse
function.
glimpse(nc)
Rows: 800
Columns: 13
$ fage <dbl> 19, 21, 18, 17, 20, 30, 21, 14, 16, 20, 18, 20, 20, 26,…
$ mage <dbl> 15, 15, 15, 15, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17,…
$ mature <chr> "younger mom", "younger mom", "younger mom", "younger m…
$ weeks <dbl> 37, 41, 37, 35, 37, 45, 38, 40, 24, 40, 37, 40, 39, 38,…
$ premie <chr> "full term", "full term", "full term", "premie", "full …
$ visits <dbl> 11, 6, 12, 5, 13, 9, 15, 12, 5, 8, 10, 17, 9, 11, 10, 1…
$ marital <chr> "married", "married", "married", "married", "married", …
$ gained <dbl> 38, 34, 76, 15, 52, 28, 75, 9, 12, 20, 39, 38, 36, 30, …
$ weight <dbl> 6.63, 8.00, 8.44, 4.69, 6.94, 7.44, 7.56, 5.81, 1.50, 8…
$ lowbirthweight <chr> "not low", "not low", "not low", "low", "not low", "not…
$ gender <chr> "female", "male", "male", "male", "female", "male", "fe…
$ habit <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", "no…
$ whitemom <chr> "white", "white", "not white", "not white", "white", "w…
We can see that there are 800 observations and 13 variables in this data set. It is good practice to see if R is treating variables as factors <fct>
; as numbers <int>
or <dbl>
(basically numbers with decimals); or as characters (i.e. text) <chr>
. The variable names are fage
, mage
, mature
, etc. The output from glimpse(nc)
tells us that six of the variables are numbers with decimals (<dbl>
). The other seven variables are character (<chr>
).
habit
to be? What variable type is visits
? (answer with text)R is considering habit to be a character
. R is considering visits to be a double
.
By clicking on the name nc
in the Environment pane (upper right window), the data is displayed in the Source pane (upper left window) in the Data Viewer. R has stored these data in a kind of spreadsheet called a data frame. Each row represents a different birth: the first entry or column in each row is simply the row number, the rest are the different variables that were recorded for each birth. You can close the data viewer by clicking on the x
in the appropriate tab.
It is a good idea to try kitting your document from time to time as you go along. Go ahead, and make sure your document is knitting, and that your html file includes Exercise headers, text, and code. Note that knitting automatically saves your Rmd file too.
Three types of graphs are explored in this problem set:
Scatterplots allow you to investigate the relationship between two numerical variables. While you may already be familiar with this type of plot, let’s view it through the lens of the Grammar of Graphics. Specifically, we will graphically investigate the relationship between the following two numerical variables in the nc
data frame:
weeks
: length of a pregnancy on the horizontal “x” axis andweight
: birth weight of a baby in pounds on the vertical “y” axisggplot(data = nc, aes(x = weeks, y = weight)) +
geom_point()
Let’s view this plot through the grammar of graphics. Within the ggplot()
function call, we specified:
nc
by setting data = nc
aes
thetic mapping
by setting aes(x = weeks, y = weight)
weeks
maps to the x
-position aes
theticweight
maps to the y
-position aes
thetic.We also add a layer to the ggplot()
function call using the +
sign. The layer in question specifies the geom
etric object as point
s using geom_point()
.
Finally, we can also add axis labels and a title to the plot as shown below. Again we add a new layer, this time a labs
or labels layer.
ggplot(data = nc, aes(x = weeks, y = weight)) +
geom_point() +
labs(x = "Length of pregnancy (in weeks)",
y = "Birth weight of baby (lbs)",
title = "Relationship between pregnancy duration and newborn weight")
weight
and weeks
? (text only to answer)Type your complete sentence answer here using inline R code and delete this comment.
The type of relationship between weight
and weeks
is positive.
weeks
again on the x axis and the variable gained
on the y axis (the amount of weight a mother gained during pregnancy). Include axis labels with measurement units, and a title. (code only to answer)# Type your code and comments inside the code chunk
ggplot(data=nc, mapping=aes(x= weeks, y= gained)) + geom_point() + labs(x= "Length of pregnancy (in weeks)",
y = "Maternal weight gain (lbs)", title = "Relationship between pregnancy duration and maternal weight gain")
Study the code below, and the resulting graphical output. Note that I added a new argument of color = premie
inside the aes
thetic mapping. The variable premie
indicates whether a birth was early (premie) or went full term. Please answer with text:
A. What did adding the argument color = premie
accomplish?
B. How many variables are now displayed on this plot?
C. What appears to (roughly) be the pregnancy length cutoff for classifying a newborn as a “premie” versus a “full term”.
ggplot(data = nc, aes(x = weeks, y = gained, color = premie))+
geom_point() +
labs(x = "Pregnancy length (wks)", y = "Maternal weight gain (lbs)")
Type your complete sentence answer here using inline R code and delete this comment.
A. Adding the argument color = premise
helps distinguish between the two options of full term or premie.
B. There are three variables displayed on this plot.
C. It appears that 37 weeks seems to be the rough precnancy length cutoff for classifying a newborn as a “premie” versus a “full term”.
mage
) and birth weight of newborns on the y axis (weight
). Color the points on the plot based on the gender of the resulting baby (variable called gender
). Does there appear to be any strong relationship between a mother’s age and the weight of her newborn? (code and text to answer)# Type your code and comments inside the code chunk
ggplot(nc, aes(x = mage, y = weight, color = gender)) + geom_point()
Type your complete sentence answer here using inline R code and delete this comment.
There is no relationship.
Make sure your document is knitting, and that your html file includes Exercise headers, text, and code. Note that knitting automatically saves your Rmd file too.
Histograms are useful plots for showing how many elements of a single numerical variable fall in specified bins. This is a very useful way to get a sense of the distribution of your data. Histograms are often one of the first steps in exploring data visually.
For instance, to look at the distribution of pregnancy duration (variable called weeks
), consider the following code:
ggplot(data = nc, aes(x = weeks))+
geom_histogram()
A few things to note here:
aes()
: the single numerical variable weeks
. You don’t need to compute the y
-aes
thetic: R calculates it automatically.geom_histogram()
We can change the binwidth (and thus the number of bins), and the colors as shown next.
ggplot(data = nc, aes(x = weeks))+
geom_histogram(binwidth = 1, color = "white", fill = "steelblue")
Note that none of these arguments went inside the aes
thetic mapping
argument as they do not specifically represent mappings of variables.
Inspect the histogram of the weeks
variable. Answer each of the following with text.
A. The y axis is labeled count. What is specifically being counted in this case? Hint: think about what each case is in this data set.
B. What appears to be roughly the average length of pregnancies in weeks?
C. If we changed the binwidth to 100, how many bins would there be? Roughly how many cases would be in each bin?
Type your complete sentence answer here using inline R code and delete this comment.
A. Count is counting the number of newborns on the y-axis.
B. The average length of pregnancies in weeks is roughly 38 weeks.
C. If we changed the binwidth to 100 there would be 1 bin. All the cases would be in the bin.
weight
of newborns (which is in lbs), include a descriptive title and axis labels. Make the bins pink with a blue border. (code only to answer)# Type your code and comments inside the code chunk
ggplot(nc, aes(x = weight)) + geom_histogram(binwidth = 1, color = "blue", fill = "pink") + labs(x = "weight of newborns (lbs)", title = "distrubutions of weight of newborns")
Faceting is used to create small multiples of the same plot over a different categorical variable. By default, all of the small multiples will have the same vertical axis.
For example, suppose we are interested in looking at whether pregnancy length varies by the maturity status of a mother (column name mature
). This is what is meant by “the distribution of one variable over another variable”: weeks
is one variable and mature
is the other variable. In order to look at histograms of weeks
for older and more mature mothers, add a plot layer using facet_wrap(~ mature, ncol = 1)
. The ncol = 1
argument tells R to stack the two histograms into one column.
ggplot(data = nc, aes(x = weeks)) +
geom_histogram(binwidth = 1, color = "white", fill = "steelblue") +
facet_wrap(~ mature, ncol = 1)
# Or
ggplot(data = nc, aes(x = weeks)) +
geom_histogram(binwidth = 1, color = "white", fill = "steelblue") +
facet_wrap(facets = vars(mature), ncol = 1)
weight
split by gender
of the child. Set the binwidth to 0.5. Which gender appears to have a slightly larger average birth weight? (code and text to answer)# Type your code and comments inside the code chunk
ggplot(nc, aes(x = weight)) + geom_histogram(binwidth = 0.5) + facet_wrap(facets = vars(gender), ncol = 1)
Type your complete sentence answer here using inline R code and delete this comment.
Males
have a slightly larger birth weight.
Make sure your document is knitting, and that your html file includes Exercise headers, text, and code. Note that knitting automatically saves your Rmd file too.
While histograms can help to show the distribution of data, boxplots have much more flexibility and can provide even more information in a single graph. The y aes
thetic is the numeric variable you want to include in the boxplot, and the x aes
thetic is a grouping variable. For instance, below gender
is the aes
thetic mapping
for x, and gained
is the aes
thetic mapping
for y. This creates a boxplot of the weight gained for mothers that had male and female newborns. Note that the fill
argument is not necessary, but sets a color for the boxplots.
ggplot(data = nc, aes(x = gender, y = gained)) +
geom_boxplot(fill = "chocolate1") +
theme_bw()
For review, these are the different parts of the boxplot: ’
gained
by moms, split by the maturity status of the mothers (mature
). Include axis labels and a title on your plot. Is the median weight gain during pregnancy larger for younger or older moms? (text and code)# Type your code and comments inside the code chunk
ggplot(nc, aes(x = mature, y = gained)) + geom_boxplot() + labs(x = "mature", y = "weight gained by mons (lbs)", title = "mature vs weight gained")
Type your complete sentence answer here using inline R code and delete this comment.
The median is larger for younger moms
.
weeks
by smoking habit
. Is the duration of pregnancy more variable for smokers or non-smokers? (i.e. which group has the greater spread for the variable weeks
?). (code and text to answer)# Type your code and comments inside the code chunk
ggplot(nc, aes(x = habit, y= weeks)) + geom_boxplot()
Type your complete sentence answer here using inline R code and delete this comment.
The duration of pregnancy is more variable for non-smokers
.
Make sure your document is knitting, and that your html file includes Exercise headers, text, and code. Note that knitting automatically saves your Rmd file too!
For the following, determine which type of plot to use, make the plot and answer any questions with text. There is a table at the end of this document that can help you determine which plot to use given the question/types of variables.
fage
) symmetrical, or does it have a skew?# Type your code and comments inside the code chunk
ggplot(nc, aes(x = fage)) + geom_histogram()
Type your complete sentence answer here using inline R code and delete this comment.
The variable for father's age
has a skew
and it is right skewed
.
weight
of babies greater for white or non-white mothers (variable called whitemom
)?# Type your code and comments inside the code chunk
ggplot(nc, aes(x = whitemom, y= weight)) + geom_boxplot()
Type your complete sentence answer here using inline R code and delete this comment.
The median birth weight
of babies is greater for white mothers
.
mage
) increases, does the duration of pregnancy (weeks
) appear to decrease?# Type your code and comments inside the code chunk
ggplot(nc, aes(x = weeks, y = mage)) + geom_point()
Type your complete sentence answer here using inline R code and delete this comment.
No, the duration of pregnancy
does not appear to decrease
.
This table is a great resource for thinking about how to visualize data.
Table 3.5 from Modern Dive http://moderndive.netlify.com/index.html