Background

Data visualization is one of the most effective ways to convey meaningful information in an efficient (and hopefully, interesting) manner.

For the purposes of this overview, we will be covering data visualization using the ggplot2 package. ggplot is a very flexible tool that implements a universal grammar of graphics in order to visualize data. Hadley Wickham offers a great overview of the basics of ggplot (from which much of this overview was adapted from) here. Additionally, an incredibly helpful data visualization cheatsheet can be found here.

A Formula for Visualizing Data

In principle, visualizing data using ggplot is simple. All you need are the following three things:

  1. Data
  2. A specification of how to map your data to a coordinate system for visualization
  3. A specification of how you want your data to be visualized (i.e., What kind of graph you want)

We’ll go over each of these one-by-one

Data

Much like how we previously covered that objects in R can be different classes, it is important to note that ggplot is designed to work best with data frames. As such, it is important to make sure that your data is saved as a data frame object (or converted to one using the as.data.frame function).

Swiss Data

Let’s try reading in a data frame and figure out a few basic ways to visualize different variables.

First, let’s read in the “swiss” dataset. This dataset contains measures of fertirility and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at during ~1888.

# Load the swiss dataset
swiss <- swiss

# Let's take a look at the head of the dataset
head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Aesthetics

Once we have our data, we need to map our data to a coordinate system in a way that will make sense for subsequent plotting. In ggplot, this process of mapping the data is occurs when we specify the aesthetics of our plot.

You map aesthetics in ggplot using the aes function. Let’s look at an example:

Let’s say, for instance, that I’m investegating how Fertility (as measured by IgA samples) is related to Education in Switzerland in 1888.

# Let's run a simple linear model for fertility and education

swiss_model <- lm(swiss$Fertility ~ Education, data=swiss)
summary(swiss_model)
## 
## Call:
## lm(formula = swiss$Fertility ~ Education, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.036  -6.711  -1.011   9.526  19.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.6101     2.1041  37.836  < 2e-16 ***
## Education    -0.8624     0.1448  -5.954 3.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared:  0.4406, Adjusted R-squared:  0.4282 
## F-statistic: 35.45 on 1 and 45 DF,  p-value: 3.659e-07
# Because these two variables seem to be related somehow, let's see how they are related.

# We'll start by making a basic scatterplot of the data

ggplot(swiss, aes(Education, Fertility)) # here education is mapped to X, and Fertility is mapped to Y

# Even though we have both variables of interest present, it didn't actually do much. So how can we see the data points on the actual plot? 

Note: You can specify other properties of your plot outside of the aes function call, but still within the ggplot function call. For example, if you have a categorical variable that you would like to see represented as different colors on your plot, you can map that variable to an argument like color or fill, depending on what you need to do. We’ll show some practical examples of this later.

Geoms

Once you’ve mapped your variables, you need to actually give ggplot instructions as to how you want the data visualized. To do this, we need to specify a geometric object, or geom. If you couldn’t already guess, if you need to create a specific kind of plot, there’s likely a geom that will fit the bill. Here are some common geoms you’re likley to utilize in your data visualization endeavors:

  • geom_point - for scatterplots
  • geom_bar - for bar plots
  • geom_line - for line plots
  • geom_hist - for histograms
  • geom_smooth - useful for plotting lines of best fit
  • geom_errorbar - for error bars
  • geom_ribbon - useful for plotting standard error around a line

There are many more geoms to become acquainted with, but these wiill suffice for now. A more complete list of geoms can be found in the data visualization cheat sheet.

Once we know what kind of plot we want to create, we simply need to specify a geom in our call for ggplot:

# Let's go ahead and make a scatterplot of our education and fertility variables
# We'll add a layer using the geom_point() function.

ggplot(swiss, aes(Education, Fertility)) +
  geom_point() # this is what actually represents our mapped data on a plot

# It looks as though there may be a negative correlation here, but I'd like to add in another layer with a regression line.

ggplot(swiss, aes(Education, Fertility)) +
  geom_point() +
  geom_smooth(method=lm)

As you can see, we literally added the geom to our call to ggpplot using “+”. ggplot is all about building layers; simply add more layers to you plot to do things like add a title, change the axes, add custom colors, etc. We’ll cover some of these later.

ggplot Template

Once we have our data, our mappings, and an appropriate geom, a graph can be generated with relative ease using the following template:

ggplot(data = <DATA>, aes(<MAPPINGS>)) + 
  <GEOM_FUNCTION>()

This is technically all that is needed to make a plot using ggplot. However, ggplot is so flexible, one should really take advantage of its functionality to make some cool plots!

Chick Weight

Next, we’re going to determine how different diets can affect the weight of baby chicks over time. We’re going to look at two different ways to visualize the exact same data. First, we’ll visualize the data using a bar plot, and then we’ll visualize the data using a line plot.

# Load in the already available dataset
ChickWeight <- ChickWeight
levels(ChickWeight$Diet) <- c("Diet 1","Diet 2","Diet 3","Diet 4")

If we look at the data, we notice that it’s in long form. There are 50 chicks total, that have been weighed at multiple time points from day 0 up to day 21. Additionally, each chick was placed into one of four groups that received different diets. Let’s visualize how these chicks do at the beginning of the diet and the end of the diet.

Bar plots

In order to create a barplot, we’ll make a dataframe containing only the first and last weight measurements for each chick (day 0 and 21). Additionally, to make a true bar plot with error bars, we’ll have to find a way to summarize the data points. Let’s wrangle some data!

# We'll create a dataframe that summarizes (i.e., mean and standard error) the data by both diet group and time point

cw_summary <-  filter(ChickWeight, (Time == 0) | (Time == 21)) %>%
  group_by(Diet, Time) %>%
  summarise(Mean = mean(weight),se = sd(weight)/(sqrt(n()))) 

# Let's take a peek at the data
head(cw_summary)
## # A tibble: 6 x 4
## # Groups:   Diet [3]
##   Diet    Time  Mean     se
##   <fct>  <dbl> <dbl>  <dbl>
## 1 Diet 1     0  41.4  0.222
## 2 Diet 1    21 178.  14.7  
## 3 Diet 2     0  40.7  0.473
## 4 Diet 2    21 215.  24.7  
## 5 Diet 3     0  40.8  0.327
## 6 Diet 3    21 270.  22.6
# We now have summarized data, which makes it much easier to plot!

# Let's convert the time variable to a factor so ggplot recognizes it as discrete, rather tha continuous
cw_summary$Time <- as.factor(cw_summary$Time) 

ggplot(cw_summary, aes(x=Diet, y=Mean, fill=Time)) + # aesthetics
  geom_bar(stat="identity", position=position_dodge()) +# geom for bar plots
  # scale_fill_brewer(palette="Paired") +
  geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2,position=position_dodge(.9))  #geom for error bars

# Pro tip: if your x or y data are "offset" from their respective axes, try either:
# scale_(x or y)_(continuous or discrete)(expand=c(0,0)) 
# this all depends on the nature of the scale you're correcting the offset for

Facets

Now that we’ve seen how to visualize the data using bar plots, let’s use faceting to visualize the data in another way.

# First we're going to make a background dataset with the diet variable removed. 
ChickWeight_bg <- ChickWeight %>% filter((Time == 0) | (Time == 21)) %>% select(-Diet)

# let's group the data by each chick; this will carry out any subsequent geoms for each chick
ggplot(filter(ChickWeight, (Time == 0) | (Time == 21)), aes(Time, weight, color = Diet, group = Chick)) +
  geom_point() + # scatter plot
  geom_line() + # line plot
  facet_wrap(~Diet) + # specify which variable we want to facet by
  geom_point(data = ChickWeight_bg, color = "grey", alpha = .2) + # alpha specifies the "transparency" 
  geom_line(data = ChickWeight_bg, color = "grey", alpha = .2) 

# with facet_wrap, you can also soecify the number of rows/columns. For example try using, facet_wrap(~Diet, ncol = 4) 

The cool part about visualizing this way (i.e., facets), as opposed to boxplots, is that it allows us to more easily see individual trajectories over time (including using more than two time points). Simply changing the function so that we’re not filtering the data to just days 0 and 21 can give us a more complete picture of the chick weight trajectories. Facets are very useful when visualizing datasets that contain categorical variables.

# Start by making a new background dataframe to incorporate all the datapoints
ChickWeight_bg <- ChickWeight %>% select(-Diet) 


# Simply use the unfiltered chickweight dataset to see trajectroies for chicks at all timepoints.

ggplot(ChickWeight, aes(Time, weight, color = Diet, group=Chick)) +
  geom_point() +
  geom_line() +
  geom_point(data = ChickWeight_bg, color = "grey", alpha = .2) + 
  geom_line(data = ChickWeight_bg, color = "grey", alpha = .2) +
  facet_wrap(~Diet) +
  labs(x = 'Time (days)', y = 'Chick Weight (g)') +
  theme(legend.position="none") 

Basics You Should Know

Making a Title and Labeling the Axes

# To add a title, simply add ggtitle()

ggplot(cw_summary, aes(x=Diet, y=Mean, fill=Time)) + 
  geom_bar(stat="identity", position=position_dodge()) +
  scale_fill_brewer(palette="Paired") +
  geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2,
                 position=position_dodge(.9)) +
  ggtitle("Chick Weight by Group") + # title!
  theme(plot.title = element_text(hjust = 0.5))

# To change your x and y labels, just add the labs() function
ggplot(cw_summary, aes(x=Diet, y=Mean, fill=Time)) + 
  geom_bar(stat="identity", position=position_dodge()) +
  scale_fill_brewer(palette="Paired") +
  geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2,
                 position=position_dodge(.9)) +
  ggtitle("Chick Weight by Group") +
  theme(plot.title = element_text(hjust = 0.5)) + # this command here centers your title
  labs(x = 'Diet', y = 'Chick Weight (g)') # X and Y labels!

Changing the Scale of Your Axes

# In order manually change the scale of your axes, you simply have to call the xlim() and ylim() functions

ggplot(swiss, aes(Education, Fertility)) +
  geom_point() +
  geom_smooth(method=lm) +
  xlim(0,60) +
  ylim(0,125) +
  scale_x_reverse() # You can also flip the X or Y axes
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

Color vs. fill

# To change the color in simple bar graphs, all you have to do is enter the argument "fill='red'" within the geom_bar() function

ggplot(filter(cw_summary, (Time == 21)), aes(x=Diet, y=Mean)) + 
  geom_bar(stat="identity", position=position_dodge()) + # specify color or fill here!
  geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2,
                 position=position_dodge(.9)) 

#color = "red"
#fill="red"

# Note: If you're having trouble setting colors, make sure the functions you're using share the same word as how you're setting the color. For example, scale_fill_manual, works when you specify "fill".

Making your plots actually look nice

So let’s face it: the default ggplot theme/colors (while an improvement over something like SPSS) aren’t exactly the prettiest. Here are some quick tips to help make your plots actually look nice.

Themes

Themes in ggplot alter the plot background, axes, and presence of gridlines. You can call a theme using the basic command:

theme_(insert theme here)()

This will globally change the theme for any subsequent plots. Some of ggplot’s themes are:

theme_gray() # this is the default theme for ggplot
theme_bw()
theme_dark()
theme_classic()
theme_light()
theme_linedraw()
theme_minimal()
theme_void()

Go ahead and change the theme for the following plot by adding it like you would any other layer in ggplot:

ggplot(ChickWeight, aes(Time, weight, color = Diet, group = Chick)) +
  geom_point() +
  geom_line() +
  geom_point(data = ChickWeight_bg, color = "grey", alpha = .2) + 
  geom_line(data = ChickWeight_bg, color = "grey", alpha = .2) +
  facet_wrap(~Diet) +
  labs(x = 'Time (days)', y = 'Chick Weight (g)') +
  theme(legend.position="none") 

# be sure to change the theme back to the default!
# theme_gray() # sets default theme

Tinkering with text

Sometimes it’s helpful to tinker with the text sizes, face, etc. for a more readable plot. In ggplot, you can control many of these using the theme() function. For example, I can change the size size and face of the text for my plot titles via:

theme(plot.title = element_text(size =20,face='bold')) +

The best part is that you can save numerous changes to an object, and then add them to ggplot like you would any other layer to make many changes at once. This allows for more consistent, customized plots. Here’s what I like to do:

# I used these settings for posters

text_settings <-
  theme(text = element_text(size = 20)) +
  theme(plot.title = element_text(size =20,face='bold')) +
  theme(axis.title.x = element_text(face='bold')) +
  theme(axis.title.y = element_text(face='bold')) +
  theme(axis.text.x = element_text(size = 20)) +
  theme(axis.text.y = element_text(size = 20)) +
  theme(axis.ticks = element_blank())

Now you can just add “text_settings” to ggplot and all these changes will take effect. Feel free to restore the defaults by changing the theme back to theme_gray(). For a comprehensive list of what you can change via theme(), see the documentation.

Notice the difference between this:

cw_summary$Time <- as.factor(cw_summary$Time) 

ggplot(cw_summary, aes(x=Diet, y=Mean, fill=Time)) + # aesthetics
  geom_bar(stat="identity", position=position_dodge()) + # geom for bar plots
  geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2,position=position_dodge(.9))  #geom for error bars

And this:

cw_summary$Time <- as.factor(cw_summary$Time) 

ggplot(cw_summary, aes(x=Diet, y=Mean, fill=Time)) + # aesthetics
  geom_bar(stat="identity", position=position_dodge()) + # geom for bar plots
  geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2,position=position_dodge(.9)) +  #geom for error bars
  text_settings

Choosing custom color palettes

Packages like RColorBrewer are awesome for creating custom color palettes. Rose, a graduate of teh department, made an awesome tutorial on color palettes that you can find here.Let’s load up the RColorBrewer library and see what colors we have available:

library(RColorBrewer)

# pick a palette!
display.brewer.all()

As you can see, we have quite a few to choose from.

# You can assign colors to a variable to use with ggplot with the following command:

colors <- brewer.pal(9,"Spectral") # "Spectral" is the name of the color family in RColorBrewer

# You can also chose as many or as few colors from a color palette as you like by indexing:

# This indexes the 3rd, 5th, and 7th colors from the "Greens family"
select_colors <- c(brewer.pal(9,"Greens")[3],brewer.pal(9,"Greens")[5],brewer.pal(9,"Greens")[7])

# You can also mix and match colors:

# This takes the 6th color from "Reds" and the 6th color from "Greens"
mixed_colors <- c(brewer.pal(9,"Reds")[6],brewer.pal(9,"Greens")[6])

# If you're really cool, you can instead find out the hex codes for your colors and concatenate them like this:

# Get that sweet vaporwave cred (these were adapted from the Python package "Vapeplot")
jazzcup <- c("#80E0DF", "#31AEA6", "#3E88BC", "#783A9C", "#3A2C82")
crystal_pepsi <- c("#CCFFFC", "#E4E9FF", "#F2DCFF", "#FFCEFF")
sunset <- c("#F58F80", "#D5539C", "#FC2A7F", "#A81B56", "#691344")

# Let's see them in action

cw_summary <-  ChickWeight %>%
  group_by(Diet) %>%
  summarise(Mean = mean(weight),
            se = sd(weight)/(sqrt(n()))) 

ggplot(cw_summary, aes(x=Diet, y=Mean, fill=Diet)) + 
  geom_bar(stat="identity", position=position_dodge()) +
  scale_fill_manual(values = jazzcup) + # this is where you manually specify a custom color palette
  geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2, position=position_dodge(.9)) +
  text_settings

# Note: You need AT LEAST as many colors as the number of unique to-be-colored variables for any color aesthetic mapping to work!

Cowplot

Cowplot is a package from the lab of Claus O. Wilke (a man if fine tastes). It’s a simple add-on to ggplot that changes some of the underlying ggplot themes behind the scenes and generally makes everything look nicer. Cowplot is the Crock Pot of ggplot; just set it and forget it. Once you load up Cowplot, it will work it’s magic and everything will look cleaner, no additional functions required.

The beauty of Cowplot is how you can combine several plots into one larger plot with labeled subplots. For example:

# at this point, install cowplot if you don't already have it

library("cowplot")
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
# data frame for plot A
cw_summary <-  ChickWeight %>%
  group_by(Diet) %>%
  summarise(Mean = mean(weight),
            se = sd(weight)/(sqrt(n())))

# data frame for plot B
cw_summary2 <-  ChickWeight %>%
  group_by(Diet,Time) %>%
  summarise(Mean = mean(weight),
            se = sd(weight)/(sqrt(n())))

# plotA saved to plot object
plotA <- ggplot(cw_summary, aes(x = Diet, y = Mean, fill = Diet)) + 
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_manual(values = jazzcup) + # this is where you manually specify a custom color palette
  geom_errorbar(aes(ymin = Mean - se, ymax = Mean + se), width = .2, position = position_dodge(.9)) +
  theme(legend.position = "none") # remove legend

# plotB saved to plot object
plotB <- ggplot(cw_summary2, aes(x = Time, y = Mean, color = Diet)) + 
  geom_point(size = 2.5) +
  scale_color_manual(values = jazzcup)   # this is where you manually specify a custom color palette

# combined plot
combined_plot <- plot_grid(plotA, plotB, labels = c("A", "B"))

# We can take it a step further by aligning both plots horizontally with the align argument (use "v" for vertical)
# We can also specify the number of rows or columns
combined_plot <- plot_grid(plotA, plotB, labels = c("A", "B"), align = "h", ncol = 2)

# view plot
combined_plot

# theme_gray() # sets default theme at the end of any more plots, just in case you want to plot like a commoner

Combined with some custom text themes, packages like Cowplot can result in some pretty slick plots. Thanks Claus! You can learn more about Cowplot here.

Minihacks

Minihack 1: Creating ERPs from raw EEG data

Event-related potentials (ERPs) are the result of averaging many trials of raw EEG, where each trial represents raw EEG time-locked to some event. Essentially, ERPs represent the relatively consistent signal amongst all the noise. Here’s some data from subjects who performed a gambling task. The GainLoss variable represents when their gamble resulted in a loss (GainLoss == 0) or a gain (Gainloss == 1). For this minihack, you will need to: * Load the cowplot library * Read in the sample EEG * Wrangle the data so you can plot the averaged data over each timepoint for both GainLoss conditions * Include the standard error around the resulting ERPs * Flip the Y axis (negative is up, old-school style) * Pick some new colors (any colors) for the GainLoss condition and implement these in your aes call * Annotate (using a rect) the time period where you see the condition-related effect * Save the plot to a plot object and view it manually

Minihack 2: Creating bar plots from ERPs

Recreate the plot from minihack 1, but this time as a bar plot of the average ERP voltage within the same time window you used to highlight the condition-related effect. This will require starting from the raw data and wrangling a little differently. Be sure to: * Include standard error bars * Use same colors you used in the first minihack * Save the plot to a plot object and view it manually

Minihack 3: Using Cowplot

  1. Use Cowplot to:
  • Stitch together your previously saved plots
  • Arrange them into one row with two columns
  • Align them horizontally
  • Label the first plot “A”, and the second plot “B”
  • Save the plot to a plot object and view it manually

Minihack 4: Get Creative

For this minihack, you will have the freedom to make your own graph from one of two readily available datasets. To get these datasets, install and load the package “reshape2”. Once loaded assign the pre-loaded dataframes “french_fries” and “tips” to their own variables in your global environment. french_fries is a dataframe consisting of data collected from a sensory experiment conducted at Iowa State University in 2004. The investigators were interested in the effect of using three different fryer oils had on the taste of the fries. “tips” is a dataset where one waiter recorded information about each tip he received over a period of a few months working in one restaurant. If you want more information regarding these datasets and their variables type ‘?french_fries’ or ‘?tips’. Simply make a plot that reveals something interesting in the data. Make sure to incorporate facets into your plot, make sure that it is asthetically pleasing, and make sure the graph matches the data you’re trying to depict. Be creative and have fun!

~~