DataScienceProject_DataViz

Data Visualization (Ally’s Version)

Author

Ally Kuznia

Data Visualization (Ally’s Version)

Data visualization is powerful tool for understanding your data and for helping your audience (whether that be reviewers, conference goers, other lab members) understand your data. This tutorial will walk you through using ggplot to create effective data visualizations.

We’ll be using a publicly available package that includes data sets about Taylor Swift’s songs. (https://taylor.wjakethompson.com/articles/taylor).

Side note for learning data visualization and other data science: Tidy Tuesday releases an open source data set every Tuesday and leaves them available on Github, so it’s an easy way to find datasets to try new things with! https://github.com/rfordatascience/tidytuesday If you click on their data folder they have folders for several years past so there is a ton available already.

1. Set up: “…Are You Ready For It?”

Installing and loading in the packages.

# install.packages("taylor")
# install.packages("tidyverse")

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(taylor)

There are several datasets within this Taylor package. We will use this one, that stores one row per song and all of the Spotify information associated with it.

songs<- taylor_all_songs

albums<- taylor_albums

erastour_surprise<- eras_tour_surprise

This package also includes some Swift themed color palettes!

album_palettes$showgirl
<color_palette[5]>
    #C44615 
    #EB8246 
    #F0CD92 
    #6CAE90 
    #3E5C38 

Let’s clean up the dataset to include only Taylor’s complete albums.

cleaned_songs<- songs %>%
  filter(album_name %in% c("Taylor Swift", "Fearless", "Speak Now", "Red", "1989", "reputation", "Lover", "folklore", "evermore", "Midnights", "THE TORTURED POETS DEPARTMENT", "The Life of a Showgirl", "Fearless (Taylor's Version)", "Speak Now (Taylor's Version)", "Red (Taylor's Version)", "1989 (Taylor's Version)"))

2. ggplot Introduction: “I’ve got a blank space baby, and I’ll write your name”

ggplot starts out like a blank canvas (or a blank space if you will), that you then add layers to “draw” on the data. To create our canvas, we will just use ggplot as a function, that we can then add layers to it.

ggplot(data = cleaned_songs)

The next step is to tell ggplot how the data should be represented on the plot. The mapping argument within the ggplot function does this using aesthetics aes(). Let’s say we wanted to visualize the relationship between danceability and energy in Taylor’s songs. If we want danceability on the x axis and energy on the y axis, we add in x = danceability, y = energy into aes()

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy)) 

Next we need to add a layer to tell ggplot how we want our data represented (a.k.a. what kind of plot do we want?) Let’s start with a simple scatterplot. In ggplot the function for a scatter plot is geom_point() and to add a layer we type a + after the ggplot function.

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy)) +
    geom_point()
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).

Notice that it gives you a warning that rows have been removed. This is likely due to NA values in the columns we are trying to plot. You can remove these rows when cleaning your data or you can have it remove NAs directly in the plot, which is useful if you don’t want to update your original dataframe or create a new one (if for example you had other columns where you don’t want to lose those rows).

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy)) +
    geom_point(na.rm = TRUE)

Now we have a basic scatter plot. Maybe we wanted to visualize these data by album rather than overall. One way that we can do this is by specifying a color for the points.

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE)

But it’s kind of hard to tell if there are trends within each album because of the density of the data so we can add a best fit line for each album. We do this by adding a layer on top of our scatter plot and using the function geom_smooth. We’ll have to tell R what method we want it to use, for this we will just use “lm”, you can also give it a specific formulas, here we will just assume a linear relationship for these data but there are other formulas you could use! We will also have it exclude the NAs from the formula when it calculates the smoothed line.

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE)

One thing about ggplot is we can change pretty much anything! In this plot, the confidence bands (the gray shading behind the smoothed line) hides a lot of the data so I want to remove that. To do that, within geom_smooth we can set se = FALSE and it removes the confidence bands. Now we can more clearly see the data points and lines.

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE, se = FALSE) 

We can also change the color of our plots! Here we will use the package’s color palette for each individual album and we add a layer with scale_color_albums() to do that. You can also adjust colors manually using scale_color_manual if you want! There are tons of palettes available for use, including many color blind friendly palettes that can be useful for presenting your data.

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE, se = FALSE) +
  scale_color_albums()

We can also adjust all of the different pieces of the graph using theme(). I can never remember all of the options when I am creating plots so I like to type in ?theme so I can see all of them!

?theme

I often end up making the background transparent to avoid having a big white box around my plots for posters, so we do that by changing the panel.background in theme to “transparent” using element_rect. You can’t tell here, but if you were to export this (see below) and put it into a powerpoint the background color of your plot would just be the background of your slide!

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE, se = FALSE) +
  scale_color_albums() +
  theme(
    panel.background = element_rect(fill = "transparent")
  )

3. “We are never, ever, ever, getting back together” (facet wrapping)

While this plot is helpful in that it shows us different colors for individual albums, it’s still hard to see any trends across albums. If we wanted to separate each graph we can using facet wrapping!

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE, se = FALSE) +
  scale_color_albums() +
  theme(
    panel.background = element_rect(fill = "transparent")
  ) +
  facet_wrap(~album_name)

Now we don’t necessarily need the legend here and I don’t like the way it looks so let’s remove it, in theme by setting the legend.position = “none”.

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE, se = FALSE) +
  scale_color_albums() +
  theme(
    panel.background = element_rect(fill = "transparent"), 
    legend.position = "none"
  ) +
  facet_wrap(~album_name)

4. Other plot types: “I am what I am ’cause you trained me”

Histograms

There are many types of plots we can create in ggplot. For my data I use histograms a lot! Let’s create a histogram of song valence in Taylor Swift’s songs. You can adjust the bin width using binwidth = in the geom_histogram() function. You can also create other kinds of plots see the list here. https://ggplot2.tidyverse.org/reference/index.html

ggplot(data = cleaned_songs) + 
  geom_histogram(aes(x = valence), na.rm = TRUE, binwidth = .01, fill = "pink3", color = "black") +
  theme(
    panel.background = element_blank()
  )

When I make histograms, the axis almost never look like what I want by default. So let’s make some changes here. First we will change the x axis limits and ticks using scale_x_continuous. I set the limits starting at 0 and ending at 1, so I can see the whole distribution from 0 to 1. I adjust the ticks using breaks within sequence.

ggplot(data = cleaned_songs) + 
  geom_histogram(aes(x = valence), na.rm = TRUE, , binwidth = .01, fill = "pink3", color = "black") +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  theme(
    panel.background = element_blank()
  )

Here the y axis is defaulted to count, but it’s actually just the number of songs so we can change the labels using labs. We can also edit the sizing of the text of plots using element_text() in theme. You can change the face (bold, italics etc) and the font using family = in element_text within theme.

ggplot(data = cleaned_songs) + 
  geom_histogram(aes(x = valence), na.rm = TRUE, binwidth = .01, fill = "pink3", color = "black") +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  labs(x = "valence", y = "number of songs", title = "Distribution of Valence of Taylor Swift Songs") +
  theme(
    panel.background = element_blank(),
    axis.text.x = element_text(size = 10), 
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    title = element_text(size = 18, family = "Times New Roman", face = "bold")
  )

Boxplots

boxplots are another common plot for visualizing categorical data. Here we will subset to just 3 albums and look at the difference in loudness across the albums.

just_a_couple<- cleaned_songs%>%
  filter(album_name %in% c("Taylor Swift","THE TORTURED POETS DEPARTMENT", "The Life of a Showgirl"))

ggplot(data = just_a_couple)+
  geom_boxplot(aes(x = album_name, y = loudness, fill = album_name),  na.rm = TRUE) +
  scale_fill_albums()+
  labs(x = "album")+
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    panel.background = element_rect(fill = "white")
  )

adding significance lines

What if we want to show whether or not the differences between albums are significantly different? There is a helpful package called ggsignif that can add lines and * for different levels of significance. Note: y_position tells ggsignif where to place the lines on the graph, and step increase will separate multiple significance lines on the y-axis.

# install.packages('ggsignif')
library('ggsignif')

just_a_couple<- cleaned_songs%>%
  filter(album_name %in% c("Taylor Swift", "THE TORTURED POETS DEPARTMENT", "The Life of a Showgirl"))

ggplot(data = just_a_couple)+
  geom_boxplot(aes(x = album_name, y = loudness, fill = album_name),  na.rm = TRUE) +
  scale_fill_albums()+
  labs(x = "album")+
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    panel.background = element_rect(fill = "white")
  ) +
geom_signif(
  aes(x = album_name, y = loudness),
  na.rm = TRUE,
comparisons = list(
  c("Taylor Swift", "The Life of a Showgirl"),
  c("Taylor Swift", "THE TORTURED POETS DEPARTMENT"),
  c("The Life of a Showgirl", "THE TORTURED POETS DEPARTMENT")
),
  test = "t.test",
  map_signif_level = TRUE,
    y_position = .45, 
step_increase = (.12)
) 

5. Saving your plots: “You dug me out of the grave and saved my heart (plot) from the fate of Ophelia”

Let’s say we want to export our plot to use on a poster. We can use ggsave to save our plots to our folder and adjust the sizing of the plot to scale to our poster size. We set the height & width in inches and the dpi is the resolution (dots per inch) it exports on. It auto saves to your working directory which you can change using path =

histo<- ggplot(data = cleaned_songs) + 
  geom_histogram(aes(x = valence), na.rm = TRUE,, binwidth = .01, fill = "pink3", color = "black") +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  labs(x = "valence", y = "number of songs", title = "Distribution of Valence of Taylor Swift Songs") +
  theme(
    panel.background = element_blank(),
    axis.text.x = element_text(size = 12), 
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 16),
    title = element_text(size = 24, family = "Times New Roman", face = "bold")
  )


ggsave("histo.png", 
       plot = histo,
       height = 5, width = 10, units = "in",
       # path = "/Users/allysonkuznia/Desktop/Data Science 2026/plots",
       dpi = 300)

6. ggplot extensions + more fun: “But now the sky is opalite”

There are several extension packages we can install to do even more with ggplot. I’ll walk through two examples but there are many many more here! https://exts.ggplot2.tidyverse.org/gallery/

ggriddges

ggridges allows for us to view the distribution of an outcome by levels of a variable. Here I will visualize the distribution of energy of the songs within each album of taylors!

# install.packages('ggridges')
library(ggridges)
songs_grouped <- cleaned_songs %>%
  group_by(album_name)

ggplot(songs_grouped, aes(x = songs_grouped$energy, y = album_name, fill = after_stat(x))) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_fill_taylor_c(name = "energy") +
  labs(title = 'taylor swift energy by album', x = 'energy', y = 'album') +
  scale_x_continuous(limits = c(0, .5))
Picking joint bandwidth of 0.0109

Perhaps we want to look at the ridges as a function of another variable. We can bin our variable (or group if categorical) and set the fill to the other variable we want to look at. So here I bin the loudness variable into 3 groups and plot energy density plots by album and by loundness.

cleaned_songs <- cleaned_songs %>%
  mutate(loudness_bin = cut(loudness, breaks = 3))

ggplot(cleaned_songs, aes(x = energy, y = album_name, fill = loudness_bin)) +
  geom_density_ridges(scale = 3, rel_min_height = 0.01, alpha = 0.7, na.rm = TRUE) +
  labs(fill = "loudness (binned)") +
  scale_fill_taylor_d()
Picking joint bandwidth of 0.00757

ggeasy

ggeasy has ton of commads for doing different things to your plots like rotating x axis labels, resizing some titles, rotating axis labels etc. Theme does a lot of these things too, but some of this might be more intuitive or easier if you are just doing one or two things.

# install.packages('ggeasy')
library(ggeasy)

ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE, se = FALSE) +
  scale_color_albums() +
  theme(
    panel.background = element_blank()
  )+
  facet_wrap(~album_name) +
  #some examples of ggeasy (this package has a lot more!)
  easy_remove_legend() +
  easy_rotate_labels(which = c("x")) +
  easy_all_text_colour("blue4") +
  easy_all_text_size(size = 12) 

patchwork

Patchwork can easily print two plots side by side! Let’s give two of our plots names and then try out patchwork to print them side by side.

#install.packages("patchwork")
library(patchwork)

facets<- ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE, se = FALSE) +
  labs(title = "Danceability by Energy in Taylor Swifts Albums") +
  scale_color_albums() +
  theme(
    panel.background = element_rect(fill = "transparent"), 
    legend.position = "none",
    axis.text.x = element_text(size = 12), 
    axis.text.y = element_text(size = 12),
    title = element_text(size = 10, family = "Times New Roman", face = "bold")
  ) +
  facet_wrap(~album_name)

histo<- ggplot(data = cleaned_songs) + 
  geom_histogram(aes(x = valence), na.rm = TRUE,, binwidth = .01, fill = "pink3", color = "black") +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  labs(x = "valence", y = "number of songs", title = "Distribution of Valence of Taylor Swift Songs") +
  theme(
    panel.background = element_blank(),
    axis.text.x = element_text(size = 12), 
    axis.text.y = element_text(size = 12),
    title = element_text(size = 10, family = "Times New Roman", face = "bold")
  )


histo + facets

Plotly

Plotly is cool if you want interactive plots! This might be useful if you are giving a presentation and want to highlight specific points. ggplotly allows you to hover over specific points and it can tell you what data are associated with each point. Plotly also has it’s own interface so if you find yourself wanting interactive plots a lot it might be useful to invest some time learning how to use it beyond the simple cases I’ve shown here.

# install.packages("plotly")
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
histo<- ggplot(data = cleaned_songs) + 
  geom_histogram(aes(x = valence), na.rm = TRUE,, binwidth = .01, fill = "pink3", color = "black") +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  labs(x = "valence", y = "number of songs", title = "Distribution of Valence of Taylor Swift Songs") +
  theme(
    panel.background = element_blank(),
    axis.text.x = element_text(size = 12), 
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 12),
    title = element_text(size = 16, family = "Times New Roman", face = "bold"))


ggplotly(histo)
facets<- ggplot(data = cleaned_songs,
       mapping = aes(x = danceability, y = energy, color = album_name)) +
    geom_point(na.rm = TRUE) +
    geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE, se = FALSE) +
  scale_color_albums() +
  theme(
    panel.background = element_rect(fill = "transparent"), 
    legend.position = "none"
  ) +
  facet_wrap(~album_name)

ggplotly(facets)

Helpful resource for general advice with data viz

https://rkabacoff.github.io/datavis/

Mini hacks “I’ve had the time of my life, fighting dragons (plots) with you”

1. Make a bar graph of energy of taylor swift songs categorized by key_name.

  • Remove the rows with an NA in key_name.

  • Color the bars to show the data points for each album.

  • Remove the axis ticks, change the color of the background (or remove it - your choice)

  • Resize the title and axis labels.

2. Add significance lines to the plot that test whether there is a significant difference in energy between the key of C and the key of E and a difference between A and G.

3. Make a plot of your choice with the Taylor Swift data. Print the plot side by side with the barplot you created in the first two mini-hacks then export the plot as a .png.

4. Find a new interesting extension and make a plot that would be useful for your own work! (Use any data you like).