── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
Load Data (without the headers!)
Sometimes you have a dataset that you didn’t put together. And sometimes that dataset is not tidy! Things like headers or extra spacing can be really annoying to deal with. Luckily, there are easy ways to handle it as you load the data into R!
# A tibble: 6 × 15
name character_height character_mass hair_color skin_color eye_color
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 R2-D2 96 32 <NA> white, bl… red
4 Darth Vader 202 136 none white yellow
5 Leia Organa 150 49 brown light brown
6 Owen Lars 178 120 brown, gr… light blue
# ℹ 9 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
# homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
# character_weapon <chr>
Renaming
For the examples below, I used the janitor package and dplyr. There are many different ways to rename in R, so we are going to review a few options!
snake_case->snake
lower_camel-> lowerCamel
upper_camel-> UpperCamel
all_caps->ALL_CAPS
data_clean<-clean_names(data, case ="snake")
Oh no! This made our variables that start with numbers change. However will we rename this….
The janitor package will uppend ‘x’ to any variables that start with a number. It’s the packages way of making it so you don’t need to wrap them in backticks.
test <- data_clean %>%rename_with(~str_remove(., "^character_"), starts_with("character_"))
The good news is that you can also fix this easily using the stringr package from tidyverse! We are going to just use the manual renaming but this is not something you always need to mess with if you’re trying to remove the same prefix or add the same prefix to multiple columns this is much easier.
We have a lot to cover today, so we won’t be going over pipes explicitly. You will notice in my code I frequently use ’%>%’. The shortcut on mac is shift+command+m.
When doing things using dplyr especially, reading from left to right can get very confusing! This is because sometimes the code is counter intuitive to what is happening. The pipe helps me be able to pass data through each argument and it also makes it much more legible.
In a sense %>% really means “and then”. I’ll be using it today because I believe it makes the task of data wrangling much easier, but it is not required. You could consider learning how to do this without the pipe for mini hacks.
data_clean %>%filter(species=="Human")
# A tibble: 35 × 15
name character_height character_mass hair_color skin_color eye_color
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Luke Skywalk… 172 77 blond fair blue
2 Darth Vader 202 136 none white yellow
3 Leia Organa 150 49 brown light brown
4 Owen Lars 178 120 brown, gr… light blue
5 Beru Whitesu… 165 75 brown light blue
6 Biggs Darkli… 183 84 black light brown
7 Obi-Wan Keno… 182 77 auburn, w… fair blue-gray
8 Anakin Skywa… 188 84 blond fair blue
9 Wilhuff Tark… 180 NA auburn, g… fair blue
10 Han Solo 180 80 brown fair brown
# ℹ 25 more rows
# ℹ 9 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
# homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
# character_weapon <chr>
data_clean %>%filter(character_mass >50)
# A tibble: 46 × 15
name character_height character_mass hair_color skin_color eye_color
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Luke Skywalk… 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 Darth Vader 202 136 none white yellow
4 Owen Lars 178 120 brown, gr… light blue
5 Beru Whitesu… 165 75 brown light blue
6 Biggs Darkli… 183 84 black light brown
7 Obi-Wan Keno… 182 77 auburn, w… fair blue-gray
8 Anakin Skywa… 188 84 blond fair blue
9 Chewbacca 228 112 brown unknown blue
10 Han Solo 180 80 brown fair brown
# ℹ 36 more rows
# ℹ 9 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
# homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
# character_weapon <chr>
A cheat sheet on how to select what you want!
data_clean %>%filter(species !="Human")
# A tibble: 48 × 15
name character_height character_mass hair_color skin_color eye_color
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow
2 R2-D2 96 32 <NA> white, bl… red
3 R5-D4 97 32 <NA> white, red red
4 Chewbacca 228 112 brown unknown blue
5 Greedo 173 74 <NA> green black
6 Jabba Desili… 175 1358 <NA> green-tan… orange
7 Yoda 66 17 white green brown
8 IG-88 200 140 none metal red
9 Bossk 190 113 none green red
10 Ackbar 180 83 none brown mot… orange
# ℹ 38 more rows
# ℹ 9 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
# homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
# character_weapon <chr>
Symbols
Meaning
‘>’
Greater than
‘>=’
Greater than or equal
‘<’
Less than
‘<=’
Less than or equal
‘!=’
Not equal
‘=’
Equal
‘&’
and
‘|’
or
‘!’
not
mutate ()
Adds new columns from current variables. This dataset doesn’t have a ton of numerical data, but we can still come up with some fun things!
data %>%mutate(name, power = (character_mass*character_height)/100)
# A tibble: 87 × 16
name character_height character_mass hair_color skin_color eye_color
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Luke Skywalk… 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 R2-D2 96 32 <NA> white, bl… red
4 Darth Vader 202 136 none white yellow
5 Leia Organa 150 49 brown light brown
6 Owen Lars 178 120 brown, gr… light blue
7 Beru Whitesu… 165 75 brown light blue
8 R5-D4 97 32 <NA> white, red red
9 Biggs Darkli… 183 84 black light brown
10 Obi-Wan Keno… 182 77 auburn, w… fair blue-gray
# ℹ 77 more rows
# ℹ 10 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
# homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
# character_weapon <chr>, power <dbl>
data %>%mutate(name, heightandmass = (character_mass + character_height))
# A tibble: 87 × 16
name character_height character_mass hair_color skin_color eye_color
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Luke Skywalk… 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 R2-D2 96 32 <NA> white, bl… red
4 Darth Vader 202 136 none white yellow
5 Leia Organa 150 49 brown light brown
6 Owen Lars 178 120 brown, gr… light blue
7 Beru Whitesu… 165 75 brown light blue
8 R5-D4 97 32 <NA> white, red red
9 Biggs Darkli… 183 84 black light brown
10 Obi-Wan Keno… 182 77 auburn, w… fair blue-gray
# ℹ 77 more rows
# ℹ 10 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
# homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
# character_weapon <chr>, heightandmass <dbl>
select ()
Essentially this allows you to only use the variables that you want to in a data frame. For me, this is also a convenient way to save a subset of a given data frame.
# A tibble: 87 × 3
name species character_height
<chr> <chr> <dbl>
1 Luke Skywalker Human 172
2 C-3PO Droid 167
3 R2-D2 Droid 96
4 Darth Vader Human 202
5 Leia Organa Human 150
6 Owen Lars Human 178
7 Beru Whitesun Lars Human 165
8 R5-D4 Droid 97
9 Biggs Darklighter Human 183
10 Obi-Wan Kenobi Human 182
# ℹ 77 more rows
Sometimes you just want one simple piece of information! If that’s the case, you want to probably use summarise. It will collapse everything itnto a single row, but it really needs to be paired with group_by to be most efficient.
Key summarize functions: mean, median, sum, sd, min, max
the main thing to remember is that it changes the unit of analysis! If you just run group_by it actually won’t look like it’s done anything until you add in another step.
This is why it worked for our weapons data frame but it doesn’t return anything useful here.
Wide vs. Long
Wide data looks like this. Each tree species only appears once and the measurements are sorted by site. The data is “wide format” because the table is literally wider. To add more data, you need to add more columns.
Long data looks like this. Each tree species appears multiple times and there are new columns for site and measurements.
Which format is better?
It really depends on what you are planning to do! The aforementioned “tidy data” is long format.
Part of the hardest part of pivoting from wide to long is visualizing what you want each of the columns and rows to be. You might even do something like this to help you figure it out:
Where are my columns going?
What will my rows be?
Is each cell an observation?
If you ever need a refresher on these visualizations, you can check out this page.
Pivot(long): easy mode
The most basic elements of pivoting to a longer format are designating which columns you will be putting into rows. To do this we use:
cols =
lists the columns to pivot
names_to
name of the created column that are currently column names
values_to
name of the created column from values that are in the wider format
data_long<- data %>% dplyr::select(name, "1990":"1993") %>%pivot_longer(cols=!name,names_to ="year",values_to ="kills")
Pivot(wide): easy mode
The most basic elements of pivoting to a wider format are designating where you will be taking the new column names and values from. To do this we use:
id_cols
lists columns that contain essential identifying information for each observation
names_from
the names of the column that will be spread out to become more columns
values_from
the name of the column that the cell values will come from
# A tibble: 11 × 3
name character_mass character_weapon
<chr> <dbl> <chr>
1 Luke Skywalker 77 Red Lightsaber
2 Biggs Darklighter 84 Red Lightsaber
3 Wedge Antilles 77 Red Lightsaber
4 Lobot 79 Red Lightsaber
5 Finis Valorum NA Red Lightsaber
6 Quarsh Panaka NA Red Lightsaber
7 Ben Quadinaros 65 Red Lightsaber
8 Plo Koon 80 Red Lightsaber
9 Dormé NA Red Lightsaber
10 Jocasta Nu NA Red Lightsaber
11 Sly Moore 48 Red Lightsaber
inner_join(data_sub1,data_sub2)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
name character_mass character_weapon height mass hair_color skin_color
<chr> <dbl> <chr> <int> <dbl> <chr> <chr>
1 Luke Skyw… 77 Red Lightsaber 172 77 blond fair
2 C-3PO 75 Blue Sword 167 75 <NA> gold
3 R2-D2 32 Green Sword 96 32 <NA> white, bl…
4 Darth Vad… 136 Purple Lightsab… 202 136 none white
5 Leia Orga… 49 Purple Sword 150 49 brown light
6 Owen Lars 120 Red Sword 178 120 brown, gr… light
7 Beru Whit… 75 Blue Lightsaber 165 75 brown light
8 R5-D4 32 Green Lightsaber 97 32 <NA> white, red
9 Biggs Dar… 84 Red Lightsaber 183 84 black light
10 Obi-Wan K… 77 Blue Sword 182 77 auburn, w… fair
# ℹ 77 more rows
# ℹ 9 more variables: eye_color <chr>, birth_year <dbl>, sex <chr>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
left_join(data_sub1,data_sub2)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
name character_mass character_weapon height mass hair_color skin_color
<chr> <dbl> <chr> <int> <dbl> <chr> <chr>
1 Luke Skyw… 77 Red Lightsaber 172 77 blond fair
2 C-3PO 75 Blue Sword 167 75 <NA> gold
3 R2-D2 32 Green Sword 96 32 <NA> white, bl…
4 Darth Vad… 136 Purple Lightsab… 202 136 none white
5 Leia Orga… 49 Purple Sword 150 49 brown light
6 Owen Lars 120 Red Sword 178 120 brown, gr… light
7 Beru Whit… 75 Blue Lightsaber 165 75 brown light
8 R5-D4 32 Green Lightsaber 97 32 <NA> white, red
9 Biggs Dar… 84 Red Lightsaber 183 84 black light
10 Obi-Wan K… 77 Blue Sword 182 77 auburn, w… fair
# ℹ 77 more rows
# ℹ 9 more variables: eye_color <chr>, birth_year <dbl>, sex <chr>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
left_join(data_sub2,data_sub1)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, character_mass <dbl>,
# character_weapon <chr>
full_join(data_sub1, data_sub2)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
name character_mass character_weapon height mass hair_color skin_color
<chr> <dbl> <chr> <int> <dbl> <chr> <chr>
1 Luke Skyw… 77 Red Lightsaber 172 77 blond fair
2 C-3PO 75 Blue Sword 167 75 <NA> gold
3 R2-D2 32 Green Sword 96 32 <NA> white, bl…
4 Darth Vad… 136 Purple Lightsab… 202 136 none white
5 Leia Orga… 49 Purple Sword 150 49 brown light
6 Owen Lars 120 Red Sword 178 120 brown, gr… light
7 Beru Whit… 75 Blue Lightsaber 165 75 brown light
8 R5-D4 32 Green Lightsaber 97 32 <NA> white, red
9 Biggs Dar… 84 Red Lightsaber 183 84 black light
10 Obi-Wan K… 77 Blue Sword 182 77 auburn, w… fair
# ℹ 77 more rows
# ℹ 9 more variables: eye_color <chr>, birth_year <dbl>, sex <chr>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Filtering Joins
Useful for subsetting data and getting rid of unwanted parts of a data frame.
They keep cases from left-hand data frame, so pay attention to which you put first.
semi_join()- returns all rows from left-hand data frame (x) where there are matching values in right-hand data frame (y). An inner join will return one row of x for each matching row of y, semi join will never duplicate rows of x.
anti_join()- returns all rows from x where there are not matching values from y. Keeps columns only from x.
This uses semi-join to show airports to show just destination airports.
anti_join(planes, flights)
Joining with `by = join_by(tailnum, year)`
# A tibble: 3,230 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
# ℹ 3,220 more rows
Mini-hacks
Use the dataset uploaded to canvas
Mini-hack 1: Tidy the data
This data set has a few problems. It looks like there are empty rows and columns. The variables are named a bunch of different ways. Please fix it so the data is “tidy” meaning:
no empty rows or columns
consistent naming (your choice)
a standardized format that is easy for someone else to see and use!
Mini-hack 2: Dplyr!
This data set honestly has way too much going on. All I want is:
species, sex, height, and mass
I want the average height and mass for each species/sex.
```{r}```
Mini-hack 3: Find the evil guys
Everyone knows the only way to tell a bad guy from a good guy is two things: weapon color. Unfortunately, it seems our weapon variable is both the weapon type and color. Can you split it for me and then group these by species? I need to know how many of each species have red or purple weapons (they’re the bad guys!).