Overview

Today we will cover:

1. Cleaning your data (janitor or other standard naming practices)

2. dplyr (part of the tidyverse)

3. tidyr

4. Merging Data Sets

5. Piping

6. HACK! (Separate File)

(Setting up)

# Clear workspace
rm(list = ls()) 

# Set working directory
setwd("~/Desktop/Data Science/W3_Wrangling")

# Knit options
chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)

# Install and load required packages. You can install the complete tidyverse with a single line of code:

#install.packages("tidyverse")

(when installing tidyverse, note the conflicts message. This tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().

#install.packages("Lahman")  #This is Lahman's baseball dataset
#install.packages("janitor")
#install.packages("readxl")

Clean your data

It is helpful standardize names for all your variables, and remove any empty rows and columns. You can do this using the “janitor” library. This is especially useful when merging datasets that contain similar variables!

Most useful janitor commands for all datasets:

  • clean_names() - standardizes all names according to snake_case

  • remove_empty_rows()

  • remove_empty_cols()

Let’s try it:

library(readxl)
netflix <- read_excel("Netflix.xlsx") #import data set
netflix
## # A tibble: 11 x 4
##    `Hours of Netflix`  Pets   GPA `empty column`
##                 <dbl> <dbl> <dbl>          <lgl>
##  1                  5     1  3.00             NA
##  2                  8     3  2.50             NA
##  3                 12     0  2.00             NA
##  4                  4     2  3.50             NA
##  5                  1     1  3.80             NA
##  6                  3     1  3.80             NA
##  7                  3     0  4.00             NA
##  8                 NA    NA    NA             NA
##  9                 NA    NA    NA             NA
## 10                  4     0  2.50             NA
## 11                  4     1  3.25             NA
library(janitor)
clean_netflix <- clean_names(netflix) 
clean_netflix <- remove_empty_rows(clean_netflix)
clean_netflix <- remove_empty_cols(clean_netflix)

clean_netflix
## # A tibble: 9 x 3
##   hours_of_netflix  pets   gpa
##              <dbl> <dbl> <dbl>
## 1                5     1  3.00
## 2                8     3  2.50
## 3               12     0  2.00
## 4                4     2  3.50
## 5                1     1  3.80
## 6                3     1  3.80
## 7                3     0  4.00
## 8                4     0  2.50
## 9                4     1  3.25

You can also create a new data frame with ALL missing values removed with: newdata <- na.omit(mydata)

dplyr

There are five key dplyr functions that allow you to solve the vast majority of data manipulation challenges:

I want to… Function
1. Pick observations by their values filter()
2. Pick variables by their names select()
3. Reorder rows arrange()
4. Create new variables from existing variables mutate()
5. Collapse many values down to a single summary summarise()

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group

For all of these functions:

  1. The first argument is a data frame

  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  3. The result is a new data frame.

1. Filter

filter()allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, let’s pull out just the most recent batting data (2016), then just look at the Baltimore Orioles:

library(Lahman)
library(tidyverse)
## ── Attaching packages ─── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.3.4     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
data <- filter(Batting, yearID == 2016)
head(data)
##    playerID yearID stint teamID lgID   G  AB  R   H X2B X3B HR RBI SB CS
## 1  abadfe01   2016     1    MIN   AL  39   1  0   0   0   0  0   0  0  0
## 2  abadfe01   2016     2    BOS   AL  18   0  0   0   0   0  0   0  0  0
## 3 abreujo02   2016     1    CHA   AL 159 624 67 183  32   1 25 100  0  2
## 4 achteaj01   2016     1    LAA   AL  27   0  0   0   0   0  0   0  0  0
## 5 ackledu01   2016     1    NYA   AL  28  61  6   9   0   0  0   4  0  0
## 6 adamecr01   2016     1    COL   NL 121 225 25  49   7   3  2  17  2  3
##   BB  SO IBB HBP SH SF GIDP
## 1  0   1   0   0  0  0    0
## 2  0   0   0   0  0  0    0
## 3 47 125   7  15  0  9   21
## 4  0   0   0   0  0  0    0
## 5  8   9   0   0  0  1    0
## 6 24  47   0   4  3  0    5
os_data <- filter(data, teamID == "BAL")

To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).

Symbols Meaning
‘>’ Greater than
‘>=’ Greater than or equal
‘<’ Less than
‘<=’ Less than or equal
‘!=’ Not equal
‘=’ Equal
‘&’ and
‘|’ or
‘!’ not

So, you can get rid of all the pitchers and players who were only briefly on the team by removing everyone with a low number of ABs:

os_data <- filter(os_data, !(AB == 0)) #Remove anyone who's AB = 0
os_data <- filter(os_data, !(AB < 100)) #Remove anyone who's AB < 100

Figure 5.1 in Wickham’s online book shows the complete set of Boolean operations.

Logical.operators

Logical.operators

2. Select

Now we want to narrow down the data to just the variables we are interested in. There are several ways of doing this:

# Select columns by name - new data frame includes only the named variables
hr <- select(os_data, playerID, AB, HR)

# Select all columns between  (inclusive)
playerID_to_RBI <- select(os_data, playerID:RBI)

# Select all columns except those from year to "lgID" (inclusive)
all_except <- select(os_data, -(yearID:lgID))

There are a number of helper functions you can use within select():

  • starts_with(“abc”): matches names that begin with “abc”.

  • ends_with(“xyz”): matches names that end with “xyz”.

  • contains(“ijk”): matches names that contain “ijk”.

See ?select for more details.

3. Arrange

arrange() simply re-arranges the order of rows according to a certain column. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

Let’s go back to the data that includes just 3 variables - player ID, AB, and HR. We named this data frame “hr”

#Re-arragne hr data set so rows go from lowest to highest hit rate. This is the default.
hr_arranged <- arrange(hr, HR, AB)

#Use desc() to re-order by a column in descending order:
hr_desc <- arrange(hr, desc(HR))
hr_desc
##     playerID  AB HR
## 1  trumbma01 613 47
## 2  davisch02 566 38
## 3  machama01 640 37
## 4  jonesad01 619 29
## 5  schoojo01 615 25
## 6  alvarpe01 337 22
## 7  wietema01 423 17
## 8  hardyjj01 405  9
## 9    kimhy01 305  6
## 10 reimono01 203  6
## 11 rickajo01 257  5
## 12 flahery01 157  3
## 13 josepca01 132  0

4. Mutate

Mutate ads new columns as a function of current variables. Let’s get our own estimate of batting average (hit rate / at bats) and add it to our data frame.

hr_desc <- mutate(hr_desc, BA = HR/AB)
hr_desc
##     playerID  AB HR         BA
## 1  trumbma01 613 47 0.07667210
## 2  davisch02 566 38 0.06713781
## 3  machama01 640 37 0.05781250
## 4  jonesad01 619 29 0.04684976
## 5  schoojo01 615 25 0.04065041
## 6  alvarpe01 337 22 0.06528190
## 7  wietema01 423 17 0.04018913
## 8  hardyjj01 405  9 0.02222222
## 9    kimhy01 305  6 0.01967213
## 10 reimono01 203  6 0.02955665
## 11 rickajo01 257  5 0.01945525
## 12 flahery01 157  3 0.01910828
## 13 josepca01 132  0 0.00000000
#If you only want to keep the newly created variable, use transmute.
hr_desc <- transmute(hr_desc, playerID = playerID, BA = HR/AB)
hr_desc
##     playerID         BA
## 1  trumbma01 0.07667210
## 2  davisch02 0.06713781
## 3  machama01 0.05781250
## 4  jonesad01 0.04684976
## 5  schoojo01 0.04065041
## 6  alvarpe01 0.06528190
## 7  wietema01 0.04018913
## 8  hardyjj01 0.02222222
## 9    kimhy01 0.01967213
## 10 reimono01 0.02955665
## 11 rickajo01 0.01945525
## 12 flahery01 0.01910828
## 13 josepca01 0.00000000

In mutate, you can use basic arithmatic, or use functions such as: * sum() x + sum(x) * mean() y - mean(y) * log() or log10()

5. Summarize

The last key verb is summarise(). It collapses a data frame to a single row. It’s not really useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

Useful summarize functions: * mean() * median() * sum() * sd() * min() or max()

Adding na.rm removes any missing values prior to computation

#Let's go back to our original 2016 Batting data. We named this "data". 

#This is saying "Just give me the sum of hit rates per team":
by_teams <- group_by(data, teamID)
sum_hr_by_team <- summarise(by_teams, HR = sum(HR))

#Now, how about the average run's per team:
avgr_by_team <- summarise(by_teams, AvgR = mean(R))

#If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

tidyr

When your data is tidy:

  1. Each column is a variable
  2. Each row is an observation
  3. Each value has it’s own cell

Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.

(when installing tidyverse, note the conflicts message. This tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().

There are two main advantages:

  1. Generally, if you have one consistent data structure, it’s easy to work with several data sets.

  2. Specifically, most built-in R functions work with vectors of values. All columns become vectors of values, which makes it easier to put our variables into functions.

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

Spreading and Gathering

For most real data, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:

  1. One variable might be spread across multiple columns.

  2. One observation might be scattered across multiple rows.

To fix these problems, you’ll need the two most important functions in tidyr: gather() and spread().

In the following examples, we will use UN Migration Data from 2015. It comes in an “untidy” excel file.

The file contains a bunch of worksheets to include different years and data broken down by total / male / female. But lets just look at ‘Table 6’ which contains the total migrants data for 2015 for this post. We need to skip the first 15 rows, which is just the title, then the next 8 rows which contains “region” info (we will just look at country-country migrations). Here is the code for that:

library(readxl)
UN_MigrantStockByOriginAndDestination_2015 <- read_excel("UN_MigrantStockByOriginAndDestination_2015.xlsx", 
    sheet = "Table 16", skip = 15)
#View(UN_MigrantStockByOriginAndDestination_2015)

I’m a fan of drop-downs, so if you would like to use the dropdown to import this data, use the following:

File -> Import Dataset

  • Select ‘Table 16’ from Sheet dropdown

  • Enter “15” in the “Skip” field.

#Notice that the first 8 rows are just regions. Conveniently, column X__5 is empty for these rows, so if we just want country-country, we can remove these rows with:
UN_filter <- filter(UN_MigrantStockByOriginAndDestination_2015, !is.na(X__5))

#Remove unnecessary columns. We just want migrants from countries to countries. Use ‘-’ to delete columns and using ‘start_with’ function inside ‘select’ command to delete multiple columns whose names matche the text pattern of “Other”.
UN_select <- select(UN_filter, -X__1, -X__3, -X__5, -Total, -starts_with("Other"))

#rename columns
UN_renamed <- rename(UN_select, destination_country = X__2, country_code = X__4)

head(UN_renamed)
## # A tibble: 6 x 234
##   destination_country country_code Afghanistan Albania Algeria
##                 <chr>        <dbl>       <dbl>   <dbl>   <dbl>
## 1             Burundi          108          NA      NA      NA
## 2             Comoros          174          NA      NA      NA
## 3            Djibouti          262          NA      NA      NA
## 4             Eritrea          232          NA      NA      NA
## 5            Ethiopia          231          NA      NA      NA
## 6               Kenya          404          NA      NA      NA
## # ... with 229 more variables: `American Samoa` <dbl>, Andorra <dbl>,
## #   Angola <dbl>, Anguilla <dbl>, `Antigua and Barbuda` <dbl>,
## #   Argentina <dbl>, Armenia <dbl>, Aruba <dbl>, Australia <dbl>,
## #   Austria <dbl>, Azerbaijan <dbl>, Bahamas <dbl>, Bahrain <dbl>,
## #   Bangladesh <dbl>, Barbados <dbl>, Belarus <dbl>, Belgium <dbl>,
## #   Belize <dbl>, Benin <dbl>, Bermuda <dbl>, Bhutan <dbl>, `Bolivia
## #   (Plurinational State of)` <dbl>, `Bonaire, Sint Eustatius and
## #   Saba` <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## #   Brazil <dbl>, `British Virgin Islands` <dbl>, `Brunei
## #   Darussalam` <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>,
## #   Burundi <dbl>, `Cabo Verde` <dbl>, Cambodia <dbl>, Cameroon <dbl>,
## #   Canada <dbl>, `Cayman Islands` <dbl>, `Central African
## #   Republic` <dbl>, Chad <dbl>, `Channel Islands` <dbl>, Chile <dbl>,
## #   China <dbl>, `China, Hong Kong Special Administrative Region` <dbl>,
## #   `China, Macao Special Administrative Region` <dbl>, Colombia <dbl>,
## #   Comoros <dbl>, Congo <dbl>, `Cook Islands` <dbl>, `Costa Rica` <dbl>,
## #   `Côte d'Ivoire` <dbl>, Croatia <dbl>, Cuba <dbl>, Curaçao <dbl>,
## #   Cyprus <dbl>, `Czech Republic` <dbl>, `Democratic People's Republic of
## #   Korea` <dbl>, `Democratic Republic of the Congo` <dbl>, Denmark <dbl>,
## #   Djibouti <dbl>, Dominica <dbl>, `Dominican Republic` <dbl>,
## #   Ecuador <dbl>, Egypt <dbl>, `El Salvador` <dbl>, `Equatorial
## #   Guinea` <dbl>, Eritrea <dbl>, Estonia <dbl>, Ethiopia <dbl>, `Faeroe
## #   Islands` <dbl>, `Falkland Islands (Malvinas)` <dbl>, Fiji <dbl>,
## #   Finland <dbl>, France <dbl>, `French Guiana` <dbl>, `French
## #   Polynesia` <dbl>, Gabon <dbl>, Gambia <dbl>, Georgia <dbl>,
## #   Germany <dbl>, Ghana <dbl>, Gibraltar <dbl>, Greece <dbl>,
## #   Greenland <dbl>, Grenada <dbl>, Guadeloupe <dbl>, Guam <dbl>,
## #   Guatemala <dbl>, Guinea <dbl>, `Guinea-Bissau` <dbl>, Guyana <dbl>,
## #   Haiti <dbl>, `Holy See` <dbl>, Honduras <dbl>, Hungary <dbl>,
## #   Iceland <dbl>, India <dbl>, Indonesia <dbl>, `Iran (Islamic Republic
## #   of)` <dbl>, Iraq <dbl>, Ireland <dbl>, `Isle of Man` <dbl>, ...

Gather

Now, we have origin countries listed separate columns. If we wanted a single “origin country” variable, we need to gather these columns into a new pair of variables. To describe that operation we need three parameters:

  1. The set of columns that represent values, not variables. In this example, those are the columns Afghanistan:Zimbabwe.

  2. The key = the name of the variable whose values form the column names. In this exaple, let’s call this variable “origin countries”.

  3. The value = The name of the variable whose values are spread over the cells. Here, it is number of migrants.

In gather() you can select columns the same way you would with select() command (e.g.,starts_with(), contains() )

In the final result, the gathered columns are dropped, and we get new key and value columns. Here, we can see the estimated number of people who migrated for each pair of the countries.

#Note - there are many pairs of countries people didn't migrate from or to, so we can get rid of those by inserting "na.rm=True" at the end:
UN_tidy <- gather(UN_renamed, origin_country, migrants, Afghanistan:Zimbabwe, na.rm=TRUE)
head(UN_tidy)
## # A tibble: 6 x 4
##   destination_country country_code origin_country migrants
##                 <chr>        <dbl>          <chr>    <dbl>
## 1               Egypt          818    Afghanistan      235
## 2               Libya          434    Afghanistan      320
## 3             Namibia          516    Afghanistan       39
## 4        South Africa          710    Afghanistan       83
## 5          Tajikistan          762    Afghanistan     7587
## 6            Malaysia          458    Afghanistan      498

Spread

Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows.

For spread() you need 2 parameters:

  1. The key = column that contains variable names. For us, this is conveniently called “Key”

  2. The value = The name of the variable whose values are spread over the cells. Here, it “Value”.

#read in the data
library(readxl)
spreading_data <- read_excel("spreading_data.xlsx")
spreading_data
## # A tibble: 12 x 4
##        Country  Year        Key      Value
##          <chr> <dbl>      <chr>      <dbl>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583
tidy_spreading_data <- spread(spreading_data, key = Key, value = Value)
tidy_spreading_data
## # A tibble: 6 x 4
##       Country  Year  cases population
##         <chr> <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

spread() returns a copy of your data set that has had the key and value columns removed. In their place, spread() adds a new column for each unique value of the key column. These unique values will form the column names of the new columns. spread() distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.