Week 4: Data Wrangling

Data Science w Rob!

Author

Gretchen Nihill

Introduction

What is data wrangling? For people who use tidyverse, they might say that it’s the process of making their data “tidy”.

Tidy data is…

  1. Each variable is a column; each column is a variable.

  2. Each observation is a row; each row is an observation.

  3. Each value is a cell; each cell is a single value.

    Let’s go over dplyr, tidyverse. Mention steps we can take when designing the surveys to cut down on these!

In a sense, tidying data is just putting it into a standardized structure so that any analysts could come along and quickly make sense of it.

Today we will cover:

  1. Import, rename (readxl, janitor package)
  2. dplyr (tidyverse)
    • filter()
    • select()
    • mutate()
    • summarize ()
    • group_by()
  3. Pivot longer, pivot wider
  4. Merging data sets

Let’s do this!

Load packages

#install.packages("nycflights13")
library(nycflights13)
library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

Load Data (without the headers!)

Sometimes you have a dataset that you didn’t put together. And sometimes that dataset is not tidy! Things like headers or extra spacing can be really annoying to deal with. Luckily, there are easy ways to handle it as you load the data into R!

data<-read_excel("starwars_classdata.xlsx")
New names:
• `` -> `...1`
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
• `` -> `...13`
• `` -> `...14`
• `` -> `...15`
head(data)
# A tibble: 6 × 15
  ...1       ...2  ...3  ...4  ...5  ...6  `star wars data!!!` ...8  ...9  ...10
  <chr>      <chr> <chr> <chr> <chr> <chr> <chr>               <chr> <chr> <chr>
1 name       char… char… hair… skin… eye_… birth_year          sex   gend… home…
2 Luke Skyw… 172   77    blond fair  blue  19                  male  masc… Tato…
3 C-3PO      167   75    <NA>  gold  yell… 112                 none  masc… Tato…
4 R2-D2      96    32    <NA>  whit… red   33                  none  masc… Naboo
5 Darth Vad… 202   136   none  white yell… 41.9                male  masc… Tato…
6 Leia Orga… 150   49    brown light brown 19                  fema… femi… Alde…
# ℹ 5 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <dbl>,
#   ...15 <chr>
data<-read_excel("starwars_classdata.xlsx", skip = 1)
head(data)
# A tibble: 6 × 15
  name           character_height character_mass hair_color skin_color eye_color
  <chr>                     <dbl>          <dbl> <chr>      <chr>      <chr>    
1 Luke Skywalker              172             77 blond      fair       blue     
2 C-3PO                       167             75 <NA>       gold       yellow   
3 R2-D2                        96             32 <NA>       white, bl… red      
4 Darth Vader                 202            136 none       white      yellow   
5 Leia Organa                 150             49 brown      light      brown    
6 Owen Lars                   178            120 brown, gr… light      blue     
# ℹ 9 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
#   homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
#   character_weapon <chr>

Renaming

For the examples below, I used the janitor package and dplyr. There are many different ways to rename in R, so we are going to review a few options!

  • snake_case->snake

  • lower_camel-> lowerCamel

  • upper_camel-> UpperCamel

  • all_caps->ALL_CAPS

data_clean<-clean_names(data, case = "snake")

Oh no! This made our variables that start with numbers change. However will we rename this….

The janitor package will uppend ‘x’ to any variables that start with a number. It’s the packages way of making it so you don’t need to wrap them in backticks.

data_clean <-rename(data_clean, 
                         '1990' = 'x1990',
                         '1991' = 'x1991',
                         '1993' = 'x1993') 
test <- data_clean %>% 
  rename_with(~ str_remove(., "^character_"), starts_with("character_"))

The good news is that you can also fix this easily using the stringr package from tidyverse! We are going to just use the manual renaming but this is not something you always need to mess with if you’re trying to remove the same prefix or add the same prefix to multiple columns this is much easier.

data_clean<-data_clean %>% remove_empty(which = c("rows", "cols"))

Dplyr

  • A super useful part of tidyverse!

  • A “grammar of data manipulation”

    • mainly verbs (think what do I want to do to my data?)
  • we won’t cover everything today- but here is a great cheat sheet:)

operation

what it does

filter () pick cases based on values
mutate() add new columns that are functions of existing ones (create new variables)
select () pick variables based on names
summarize() reduce multiple values down to a single summary (mean, median)
group_by() operations performed by group (rather than entire data set

Ok so it’s a “grammar” what are the rules?

  1. The first argument is a data frame.

  2. The second argument describes what to do with the data frame.

  3. The result is a new data frame

    data_clean_human<-filter(data_clean, species== "Human")

    A quick note about %>%

    We have a lot to cover today, so we won’t be going over pipes explicitly. You will notice in my code I frequently use ’%>%’. The shortcut on mac is shift+command+m.

    When doing things using dplyr especially, reading from left to right can get very confusing! This is because sometimes the code is counter intuitive to what is happening. The pipe helps me be able to pass data through each argument and it also makes it much more legible.

    In a sense %>% really means “and then”. I’ll be using it today because I believe it makes the task of data wrangling much easier, but it is not required. You could consider learning how to do this without the pipe for mini hacks.

data_clean %>% filter(species== "Human")
# A tibble: 35 × 15
   name          character_height character_mass hair_color skin_color eye_color
   <chr>                    <dbl>          <dbl> <chr>      <chr>      <chr>    
 1 Luke Skywalk…              172             77 blond      fair       blue     
 2 Darth Vader                202            136 none       white      yellow   
 3 Leia Organa                150             49 brown      light      brown    
 4 Owen Lars                  178            120 brown, gr… light      blue     
 5 Beru Whitesu…              165             75 brown      light      blue     
 6 Biggs Darkli…              183             84 black      light      brown    
 7 Obi-Wan Keno…              182             77 auburn, w… fair       blue-gray
 8 Anakin Skywa…              188             84 blond      fair       blue     
 9 Wilhuff Tark…              180             NA auburn, g… fair       blue     
10 Han Solo                   180             80 brown      fair       brown    
# ℹ 25 more rows
# ℹ 9 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
#   homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
#   character_weapon <chr>
data_clean %>% filter(character_mass > 50)
# A tibble: 46 × 15
   name          character_height character_mass hair_color skin_color eye_color
   <chr>                    <dbl>          <dbl> <chr>      <chr>      <chr>    
 1 Luke Skywalk…              172             77 blond      fair       blue     
 2 C-3PO                      167             75 <NA>       gold       yellow   
 3 Darth Vader                202            136 none       white      yellow   
 4 Owen Lars                  178            120 brown, gr… light      blue     
 5 Beru Whitesu…              165             75 brown      light      blue     
 6 Biggs Darkli…              183             84 black      light      brown    
 7 Obi-Wan Keno…              182             77 auburn, w… fair       blue-gray
 8 Anakin Skywa…              188             84 blond      fair       blue     
 9 Chewbacca                  228            112 brown      unknown    blue     
10 Han Solo                   180             80 brown      fair       brown    
# ℹ 36 more rows
# ℹ 9 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
#   homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
#   character_weapon <chr>

A cheat sheet on how to select what you want!

data_clean %>% filter(species != "Human")
# A tibble: 48 × 15
   name          character_height character_mass hair_color skin_color eye_color
   <chr>                    <dbl>          <dbl> <chr>      <chr>      <chr>    
 1 C-3PO                      167             75 <NA>       gold       yellow   
 2 R2-D2                       96             32 <NA>       white, bl… red      
 3 R5-D4                       97             32 <NA>       white, red red      
 4 Chewbacca                  228            112 brown      unknown    blue     
 5 Greedo                     173             74 <NA>       green      black    
 6 Jabba Desili…              175           1358 <NA>       green-tan… orange   
 7 Yoda                        66             17 white      green      brown    
 8 IG-88                      200            140 none       metal      red      
 9 Bossk                      190            113 none       green      red      
10 Ackbar                     180             83 none       brown mot… orange   
# ℹ 38 more rows
# ℹ 9 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
#   homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
#   character_weapon <chr>
Symbols Meaning
‘>’ Greater than
‘>=’ Greater than or equal
‘<’ Less than
‘<=’ Less than or equal
‘!=’ Not equal
‘=’ Equal
‘&’ and
‘|’ or
‘!’ not

mutate ()

Adds new columns from current variables. This dataset doesn’t have a ton of numerical data, but we can still come up with some fun things!

data %>% mutate(name, power = (character_mass*character_height)/100)
# A tibble: 87 × 16
   name          character_height character_mass hair_color skin_color eye_color
   <chr>                    <dbl>          <dbl> <chr>      <chr>      <chr>    
 1 Luke Skywalk…              172             77 blond      fair       blue     
 2 C-3PO                      167             75 <NA>       gold       yellow   
 3 R2-D2                       96             32 <NA>       white, bl… red      
 4 Darth Vader                202            136 none       white      yellow   
 5 Leia Organa                150             49 brown      light      brown    
 6 Owen Lars                  178            120 brown, gr… light      blue     
 7 Beru Whitesu…              165             75 brown      light      blue     
 8 R5-D4                       97             32 <NA>       white, red red      
 9 Biggs Darkli…              183             84 black      light      brown    
10 Obi-Wan Keno…              182             77 auburn, w… fair       blue-gray
# ℹ 77 more rows
# ℹ 10 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
#   homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
#   character_weapon <chr>, power <dbl>
data %>% mutate(name, heightandmass = (character_mass + character_height))
# A tibble: 87 × 16
   name          character_height character_mass hair_color skin_color eye_color
   <chr>                    <dbl>          <dbl> <chr>      <chr>      <chr>    
 1 Luke Skywalk…              172             77 blond      fair       blue     
 2 C-3PO                      167             75 <NA>       gold       yellow   
 3 R2-D2                       96             32 <NA>       white, bl… red      
 4 Darth Vader                202            136 none       white      yellow   
 5 Leia Organa                150             49 brown      light      brown    
 6 Owen Lars                  178            120 brown, gr… light      blue     
 7 Beru Whitesu…              165             75 brown      light      blue     
 8 R5-D4                       97             32 <NA>       white, red red      
 9 Biggs Darkli…              183             84 black      light      brown    
10 Obi-Wan Keno…              182             77 auburn, w… fair       blue-gray
# ℹ 77 more rows
# ℹ 10 more variables: birth_year <dbl>, sex <chr>, gender <chr>,
#   homeworld <chr>, species <chr>, `1990` <dbl>, `1991` <dbl>, `1993` <dbl>,
#   character_weapon <chr>, heightandmass <dbl>

select ()

Essentially this allows you to only use the variables that you want to in a data frame. For me, this is also a convenient way to save a subset of a given data frame.

data_clean %>% dplyr::select(name, species, character_height)
# A tibble: 87 × 3
   name               species character_height
   <chr>              <chr>              <dbl>
 1 Luke Skywalker     Human                172
 2 C-3PO              Droid                167
 3 R2-D2              Droid                 96
 4 Darth Vader        Human                202
 5 Leia Organa        Human                150
 6 Owen Lars          Human                178
 7 Beru Whitesun Lars Human                165
 8 R5-D4              Droid                 97
 9 Biggs Darklighter  Human                183
10 Obi-Wan Kenobi     Human                182
# ℹ 77 more rows
data_clean %>% dplyr::select(name, starts_with ("19"))
# A tibble: 87 × 4
   name               `1990` `1991` `1993`
   <chr>               <dbl>  <dbl>  <dbl>
 1 Luke Skywalker          9      5      6
 2 C-3PO                   7      4      4
 3 R2-D2                   5      3      3
 4 Darth Vader             3      2      2
 5 Leia Organa             9      1      1
 6 Owen Lars               7      5      5
 7 Beru Whitesun Lars      5      4      4
 8 R5-D4                   3      3      3
 9 Biggs Darklighter       9      2      2
10 Obi-Wan Kenobi          7      1      1
# ℹ 77 more rows

summarise ()

Sometimes you just want one simple piece of information! If that’s the case, you want to probably use summarise. It will collapse everything itnto a single row, but it really needs to be paired with group_by to be most efficient.

Key summarize functions: mean, median, sum, sd, min, max

weapons<-data_clean %>% summarise(n
                   =n())
weapons<-data_clean %>% group_by(character_weapon) %>% 
               summarise(n
                   =n())
weapons<-data_clean %>% group_by(species) %>% 
               summarise(Avg_height = mean(character_height))

group_by()

the main thing to remember is that it changes the unit of analysis! If you just run group_by it actually won’t look like it’s done anything until you add in another step.

weapons<- data_clean %>% group_by (species)
weapons<- weapons %>% summarise (Avg_height = mean(character_height))

If I do this after I’ve already grouped by species, then it will group it by species even though they’re separate lines.

data_clean2<- data_clean  %>% summarise (Avg_height = mean(character_height))

This is why it worked for our weapons data frame but it doesn’t return anything useful here.

Wide vs. Long

Wide data looks like this. Each tree species only appears once and the measurements are sorted by site. The data is “wide format” because the table is literally wider. To add more data, you need to add more columns.

Long data looks like this. Each tree species appears multiple times and there are new columns for site and measurements.

Which format is better?

It really depends on what you are planning to do! The aforementioned “tidy data” is long format.

Part of the hardest part of pivoting from wide to long is visualizing what you want each of the columns and rows to be. You might even do something like this to help you figure it out:

  • Where are my columns going?

  • What will my rows be?

  • Is each cell an observation?

If you ever need a refresher on these visualizations, you can check out this page.

Pivot(long): easy mode

The most basic elements of pivoting to a longer format are designating which columns you will be putting into rows. To do this we use:

  • cols =

    • lists the columns to pivot
  • names_to

    • name of the created column that are currently column names
  • values_to

    • name of the created column from values that are in the wider format
data_long<- data %>% dplyr::select(name, "1990":"1993") %>% 
  pivot_longer(
    cols=!name,
  names_to = "year",
  values_to = "kills"
)

Pivot(wide): easy mode

The most basic elements of pivoting to a wider format are designating where you will be taking the new column names and values from. To do this we use:

  • id_cols

    • lists columns that contain essential identifying information for each observation
  • names_from

    • the names of the column that will be spread out to become more columns
  • values_from

    • the name of the column that the cell values will come from
data_wide<- data_long %>% pivot_wider(
  names_from = year,
  values_from = kills
)

Let’s do something harder!

data_long<- data_clean %>% 
dplyr::select(name, "1990":"1993",character_weapon) %>% 
  pivot_longer(cols = !name,
               names_to = "Names",
               values_to = "Values",
               values_transform = list(Values = as.character)) %>% 
  # separate(Names, into = c("names1", "names2"), sep = "_" ) %>% 
  group_by(name) %>% 
 mutate(Alignment = case_when(("Red Lightsaber" %in% Values) ~ "Evil", 
                              ("Red Sword" %in% Values) ~ "Evil", 
                              ("Blue Lightsaber" %in% Values) ~ "Good", 
                              ("Blue Sword" %in% Values) ~ "Good", 
                              ("Green Lightsaber" %in% Values) ~ "Good", 
                              ("Green Sword" %in% Values) ~ "Good", 
                              ("Purple Lightsaber" %in% Values) ~ "Evil", 
                              ("Purple Sword" %in% Values) ~ "Evil"))
 data_wider<- data_long %>%  ungroup() %>% 
  pivot_wider(id_cols = name,
              names_from = "Names",
              values_from = "Values")
data_wider2<- data_long %>%  ungroup() %>% 
  pivot_wider(id_cols = name,
              names_from = "Names",
              values_from = c(Values, Alignment))
 data_wider3<- data_long %>%  ungroup() %>% 
  pivot_wider(id_cols = c(name ,Alignment),
              names_from = "Names",
              values_from = "Values")

Merging Data Sets

When merging data sets I tend to use Mutating joins. These combine variable from two data.frames:

  • inner_join()- only keeps observations from x that have a matching key in y

    • unmatched rows in both inputs are not included, very easy to lose observations
  • left_join()- keep all observations in x

  • right_join()- keep all observations in y

  • full_join()- keeps all observations in x and y.

First let’s make two subsets

data_sub1<-data_clean %>% dplyr::select(name, character_mass, character_weapon)
data_sub2<-starwars 

Recombine

inner_join (data_sub1,data_sub2)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
   name       character_mass character_weapon height  mass hair_color skin_color
   <chr>               <dbl> <chr>             <int> <dbl> <chr>      <chr>     
 1 Luke Skyw…             77 Red Lightsaber      172    77 blond      fair      
 2 C-3PO                  75 Blue Sword          167    75 <NA>       gold      
 3 R2-D2                  32 Green Sword          96    32 <NA>       white, bl…
 4 Darth Vad…            136 Purple Lightsab…    202   136 none       white     
 5 Leia Orga…             49 Purple Sword        150    49 brown      light     
 6 Owen Lars             120 Red Sword           178   120 brown, gr… light     
 7 Beru Whit…             75 Blue Lightsaber     165    75 brown      light     
 8 R5-D4                  32 Green Lightsaber     97    32 <NA>       white, red
 9 Biggs Dar…             84 Red Lightsaber      183    84 black      light     
10 Obi-Wan K…             77 Blue Sword          182    77 auburn, w… fair      
# ℹ 77 more rows
# ℹ 9 more variables: eye_color <chr>, birth_year <dbl>, sex <chr>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
data_sub1 %>% filter (character_weapon == 'Red Lightsaber')
# A tibble: 11 × 3
   name              character_mass character_weapon
   <chr>                      <dbl> <chr>           
 1 Luke Skywalker                77 Red Lightsaber  
 2 Biggs Darklighter             84 Red Lightsaber  
 3 Wedge Antilles                77 Red Lightsaber  
 4 Lobot                         79 Red Lightsaber  
 5 Finis Valorum                 NA Red Lightsaber  
 6 Quarsh Panaka                 NA Red Lightsaber  
 7 Ben Quadinaros                65 Red Lightsaber  
 8 Plo Koon                      80 Red Lightsaber  
 9 Dormé                         NA Red Lightsaber  
10 Jocasta Nu                    NA Red Lightsaber  
11 Sly Moore                     48 Red Lightsaber  
inner_join(data_sub1,data_sub2)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
   name       character_mass character_weapon height  mass hair_color skin_color
   <chr>               <dbl> <chr>             <int> <dbl> <chr>      <chr>     
 1 Luke Skyw…             77 Red Lightsaber      172    77 blond      fair      
 2 C-3PO                  75 Blue Sword          167    75 <NA>       gold      
 3 R2-D2                  32 Green Sword          96    32 <NA>       white, bl…
 4 Darth Vad…            136 Purple Lightsab…    202   136 none       white     
 5 Leia Orga…             49 Purple Sword        150    49 brown      light     
 6 Owen Lars             120 Red Sword           178   120 brown, gr… light     
 7 Beru Whit…             75 Blue Lightsaber     165    75 brown      light     
 8 R5-D4                  32 Green Lightsaber     97    32 <NA>       white, red
 9 Biggs Dar…             84 Red Lightsaber      183    84 black      light     
10 Obi-Wan K…             77 Blue Sword          182    77 auburn, w… fair      
# ℹ 77 more rows
# ℹ 9 more variables: eye_color <chr>, birth_year <dbl>, sex <chr>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
left_join(data_sub1,data_sub2)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
   name       character_mass character_weapon height  mass hair_color skin_color
   <chr>               <dbl> <chr>             <int> <dbl> <chr>      <chr>     
 1 Luke Skyw…             77 Red Lightsaber      172    77 blond      fair      
 2 C-3PO                  75 Blue Sword          167    75 <NA>       gold      
 3 R2-D2                  32 Green Sword          96    32 <NA>       white, bl…
 4 Darth Vad…            136 Purple Lightsab…    202   136 none       white     
 5 Leia Orga…             49 Purple Sword        150    49 brown      light     
 6 Owen Lars             120 Red Sword           178   120 brown, gr… light     
 7 Beru Whit…             75 Blue Lightsaber     165    75 brown      light     
 8 R5-D4                  32 Green Lightsaber     97    32 <NA>       white, red
 9 Biggs Dar…             84 Red Lightsaber      183    84 black      light     
10 Obi-Wan K…             77 Blue Sword          182    77 auburn, w… fair      
# ℹ 77 more rows
# ℹ 9 more variables: eye_color <chr>, birth_year <dbl>, sex <chr>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
left_join(data_sub2,data_sub1)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, character_mass <dbl>,
#   character_weapon <chr>
full_join(data_sub1, data_sub2)
Joining with `by = join_by(name)`
# A tibble: 87 × 16
   name       character_mass character_weapon height  mass hair_color skin_color
   <chr>               <dbl> <chr>             <int> <dbl> <chr>      <chr>     
 1 Luke Skyw…             77 Red Lightsaber      172    77 blond      fair      
 2 C-3PO                  75 Blue Sword          167    75 <NA>       gold      
 3 R2-D2                  32 Green Sword          96    32 <NA>       white, bl…
 4 Darth Vad…            136 Purple Lightsab…    202   136 none       white     
 5 Leia Orga…             49 Purple Sword        150    49 brown      light     
 6 Owen Lars             120 Red Sword           178   120 brown, gr… light     
 7 Beru Whit…             75 Blue Lightsaber     165    75 brown      light     
 8 R5-D4                  32 Green Lightsaber     97    32 <NA>       white, red
 9 Biggs Dar…             84 Red Lightsaber      183    84 black      light     
10 Obi-Wan K…             77 Blue Sword          182    77 auburn, w… fair      
# ℹ 77 more rows
# ℹ 9 more variables: eye_color <chr>, birth_year <dbl>, sex <chr>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filtering Joins

Useful for subsetting data and getting rid of unwanted parts of a data frame.

They keep cases from left-hand data frame, so pay attention to which you put first.

  • semi_join()- returns all rows from left-hand data frame (x) where there are matching values in right-hand data frame (y). An inner join will return one row of x for each matching row of y, semi join will never duplicate rows of x.

  • anti_join()- returns all rows from x where there are not matching values from y. Keeps columns only from x.

library(nycflights13)
Weather<-weather
planes<-planes
airlines<-airlines

I really recommend visiting this site if you need extra help on joining for the mini hacks.

flights2<- flights %>%  dplyr::select(year, origin, dest, carrier,tailnum)
flights2 %>% filter(origin == 'LGA') %>% left_join(planes %>% dplyr::select(tailnum, seats))
Joining with `by = join_by(tailnum)`
# A tibble: 104,662 × 6
    year origin dest  carrier tailnum seats
   <int> <chr>  <chr> <chr>   <chr>   <int>
 1  2013 LGA    IAH   UA      N24211    149
 2  2013 LGA    ATL   DL      N668DN    178
 3  2013 LGA    IAD   EV      N829AS     55
 4  2013 LGA    ORD   AA      N3ALAA     NA
 5  2013 LGA    DFW   AA      N3DUAA     NA
 6  2013 LGA    FLL   B6      N595JB    200
 7  2013 LGA    ATL   MQ      N542MQ     NA
 8  2013 LGA    MSP   DL      N971DL    142
 9  2013 LGA    DTW   MQ      N730MQ     NA
10  2013 LGA    MIA   AA      N3EMAA     NA
# ℹ 104,652 more rows
airports<-airports %>% semi_join(flights2, join_by(faa == dest))

This uses semi-join to show airports to show just destination airports.

anti_join(planes, flights)
Joining with `by = join_by(tailnum, year)`
# A tibble: 3,230 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,220 more rows

Mini-hacks

Use the dataset uploaded to canvas

Mini-hack 1: Tidy the data

This data set has a few problems. It looks like there are empty rows and columns. The variables are named a bunch of different ways. Please fix it so the data is “tidy” meaning:

  1. no empty rows or columns

  2. consistent naming (your choice)

  3. a standardized format that is easy for someone else to see and use!

Mini-hack 2: Dplyr!

This data set honestly has way too much going on. All I want is:

  • species, sex, height, and mass

  • I want the average height and mass for each species/sex.

```{r}

```

Mini-hack 3: Find the evil guys

Everyone knows the only way to tell a bad guy from a good guy is two things: weapon color. Unfortunately, it seems our weapon variable is both the weapon type and color. Can you split it for me and then group these by species? I need to know how many of each species have red or purple weapons (they’re the bad guys!).