Overview

Today we will cover:

1. Cleaning your data (janitor or other standard naming practices)

2. dplyr (part of the tidyverse)

3. tidyr

4. Merging Data Sets

5. Piping

6. HACK! (Separate File)

(Setting up)

# Clear workspace
rm(list = ls()) 

# Set working directory
setwd("~/Desktop/Data Science/W3_Wrangling")

# Knit options
chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)

# Install and load required packages. You can install the complete tidyverse with a single line of code:

#install.packages("tidyverse")

(when installing tidyverse, note the conflicts message. This tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().

#install.packages("Lahman")  #This is Lahman's baseball dataset
#install.packages("janitor")
#install.packages("readxl")

Clean your data

It is helpful standardize names for all your variables, and remove any empty rows and columns. You can do this using the “janitor” library. This is especially useful when merging datasets that contain similar variables!

Most useful janitor commands for all datasets:

  • clean_names() - standardizes all names according to snake_case

  • remove_empty_rows()

  • remove_empty_cols()

Let’s try it:

library(readxl)
netflix <- read_excel("Netflix.xlsx") #import data set
netflix
## # A tibble: 11 x 4
##    `Hours of Netflix`  Pets   GPA `empty column`
##                 <dbl> <dbl> <dbl>          <lgl>
##  1                  5     1  3.00             NA
##  2                  8     3  2.50             NA
##  3                 12     0  2.00             NA
##  4                  4     2  3.50             NA
##  5                  1     1  3.80             NA
##  6                  3     1  3.80             NA
##  7                  3     0  4.00             NA
##  8                 NA    NA    NA             NA
##  9                 NA    NA    NA             NA
## 10                  4     0  2.50             NA
## 11                  4     1  3.25             NA
library(janitor)
clean_netflix <- clean_names(netflix) 
clean_netflix <- remove_empty_rows(clean_netflix)
clean_netflix <- remove_empty_cols(clean_netflix)

clean_netflix
## # A tibble: 9 x 3
##   hours_of_netflix  pets   gpa
##              <dbl> <dbl> <dbl>
## 1                5     1  3.00
## 2                8     3  2.50
## 3               12     0  2.00
## 4                4     2  3.50
## 5                1     1  3.80
## 6                3     1  3.80
## 7                3     0  4.00
## 8                4     0  2.50
## 9                4     1  3.25

You can also create a new data frame with ALL missing values removed with: newdata <- na.omit(mydata)

dplyr

There are five key dplyr functions that allow you to solve the vast majority of data manipulation challenges:

I want to… Function
1. Pick observations by their values filter()
2. Pick variables by their names select()
3. Reorder rows arrange()
4. Create new variables from existing variables mutate()
5. Collapse many values down to a single summary summarise()

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group

For all of these functions:

  1. The first argument is a data frame

  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  3. The result is a new data frame.

1. Filter

filter()allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, let’s pull out just the most recent batting data (2016), then just look at the Baltimore Orioles:

library(Lahman)
library(tidyverse)
## ── Attaching packages ─── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.3.4     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
data <- filter(Batting, yearID == 2016)
head(data)
##    playerID yearID stint teamID lgID   G  AB  R   H X2B X3B HR RBI SB CS
## 1  abadfe01   2016     1    MIN   AL  39   1  0   0   0   0  0   0  0  0
## 2  abadfe01   2016     2    BOS   AL  18   0  0   0   0   0  0   0  0  0
## 3 abreujo02   2016     1    CHA   AL 159 624 67 183  32   1 25 100  0  2
## 4 achteaj01   2016     1    LAA   AL  27   0  0   0   0   0  0   0  0  0
## 5 ackledu01   2016     1    NYA   AL  28  61  6   9   0   0  0   4  0  0
## 6 adamecr01   2016     1    COL   NL 121 225 25  49   7   3  2  17  2  3
##   BB  SO IBB HBP SH SF GIDP
## 1  0   1   0   0  0  0    0
## 2  0   0   0   0  0  0    0
## 3 47 125   7  15  0  9   21
## 4  0   0   0   0  0  0    0
## 5  8   9   0   0  0  1    0
## 6 24  47   0   4  3  0    5
os_data <- filter(data, teamID == "BAL")

To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).

Symbols Meaning
‘>’ Greater than
‘>=’ Greater than or equal
‘<’ Less than
‘<=’ Less than or equal
‘!=’ Not equal
‘=’ Equal
‘&’ and
‘|’ or
‘!’ not

So, you can get rid of all the pitchers and players who were only briefly on the team by removing everyone with a low number of ABs:

os_data <- filter(os_data, !(AB == 0)) #Remove anyone who's AB = 0
os_data <- filter(os_data, !(AB < 100)) #Remove anyone who's AB < 100

Figure 5.1 in Wickham’s online book shows the complete set of Boolean operations.

Logical.operators

Logical.operators

2. Select

Now we want to narrow down the data to just the variables we are interested in. There are several ways of doing this:

# Select columns by name - new data frame includes only the named variables
hr <- select(os_data, playerID, AB, HR)

# Select all columns between  (inclusive)
playerID_to_RBI <- select(os_data, playerID:RBI)

# Select all columns except those from year to "lgID" (inclusive)
all_except <- select(os_data, -(yearID:lgID))

There are a number of helper functions you can use within select():

  • starts_with(“abc”): matches names that begin with “abc”.

  • ends_with(“xyz”): matches names that end with “xyz”.

  • contains(“ijk”): matches names that contain “ijk”.

See ?select for more details.

3. Arrange

arrange() simply re-arranges the order of rows according to a certain column. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

Let’s go back to the data that includes just 3 variables - player ID, AB, and HR. We named this data frame “hr”

#Re-arragne hr data set so rows go from lowest to highest hit rate. This is the default.
hr_arranged <- arrange(hr, HR, AB)

#Use desc() to re-order by a column in descending order:
hr_desc <- arrange(hr, desc(HR))
hr_desc
##     playerID  AB HR
## 1  trumbma01 613 47
## 2  davisch02 566 38
## 3  machama01 640 37
## 4  jonesad01 619 29
## 5  schoojo01 615 25
## 6  alvarpe01 337 22
## 7  wietema01 423 17
## 8  hardyjj01 405  9
## 9    kimhy01 305  6
## 10 reimono01 203  6
## 11 rickajo01 257  5
## 12 flahery01 157  3
## 13 josepca01 132  0

4. Mutate

Mutate ads new columns as a function of current variables. Let’s get our own estimate of batting average (hit rate / at bats) and add it to our data frame.

hr_desc <- mutate(hr_desc, BA = HR/AB)
hr_desc
##     playerID  AB HR         BA
## 1  trumbma01 613 47 0.07667210
## 2  davisch02 566 38 0.06713781
## 3  machama01 640 37 0.05781250
## 4  jonesad01 619 29 0.04684976
## 5  schoojo01 615 25 0.04065041
## 6  alvarpe01 337 22 0.06528190
## 7  wietema01 423 17 0.04018913
## 8  hardyjj01 405  9 0.02222222
## 9    kimhy01 305  6 0.01967213
## 10 reimono01 203  6 0.02955665
## 11 rickajo01 257  5 0.01945525
## 12 flahery01 157  3 0.01910828
## 13 josepca01 132  0 0.00000000
#If you only want to keep the newly created variable, use transmute.
hr_desc <- transmute(hr_desc, playerID = playerID, BA = HR/AB)
hr_desc
##     playerID         BA
## 1  trumbma01 0.07667210
## 2  davisch02 0.06713781
## 3  machama01 0.05781250
## 4  jonesad01 0.04684976
## 5  schoojo01 0.04065041
## 6  alvarpe01 0.06528190
## 7  wietema01 0.04018913
## 8  hardyjj01 0.02222222
## 9    kimhy01 0.01967213
## 10 reimono01 0.02955665
## 11 rickajo01 0.01945525
## 12 flahery01 0.01910828
## 13 josepca01 0.00000000

In mutate, you can use basic arithmatic, or use functions such as: * sum() x + sum(x) * mean() y - mean(y) * log() or log10()

5. Summarize

The last key verb is summarise(). It collapses a data frame to a single row. It’s not really useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

Useful summarize functions: * mean() * median() * sum() * sd() * min() or max()

Adding na.rm removes any missing values prior to computation

#Let's go back to our original 2016 Batting data. We named this "data". 

#This is saying "Just give me the sum of hit rates per team":
by_teams <- group_by(data, teamID)
sum_hr_by_team <- summarise(by_teams, HR = sum(HR))

#Now, how about the average run's per team:
avgr_by_team <- summarise(by_teams, AvgR = mean(R))

#If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

tidyr

When your data is tidy:

  1. Each column is a variable
  2. Each row is an observation
  3. Each value has it’s own cell

Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.

(when installing tidyverse, note the conflicts message. This tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().

There are two main advantages:

  1. Generally, if you have one consistent data structure, it’s easy to work with several data sets.

  2. Specifically, most built-in R functions work with vectors of values. All columns become vectors of values, which makes it easier to put our variables into functions.

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

Spreading and Gathering

For most real data, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:

  1. One variable might be spread across multiple columns.

  2. One observation might be scattered across multiple rows.

To fix these problems, you’ll need the two most important functions in tidyr: gather() and spread().

In the following examples, we will use UN Migration Data from 2015. It comes in an “untidy” excel file.

The file contains a bunch of worksheets to include different years and data broken down by total / male / female. But lets just look at ‘Table 6’ which contains the total migrants data for 2015 for this post. We need to skip the first 15 rows, which is just the title, then the next 8 rows which contains “region” info (we will just look at country-country migrations). Here is the code for that:

library(readxl)
UN_MigrantStockByOriginAndDestination_2015 <- read_excel("UN_MigrantStockByOriginAndDestination_2015.xlsx", 
    sheet = "Table 16", skip = 15)
#View(UN_MigrantStockByOriginAndDestination_2015)

I’m a fan of drop-downs, so if you would like to use the dropdown to import this data, use the following:

File -> Import Dataset

  • Select ‘Table 16’ from Sheet dropdown

  • Enter “15” in the “Skip” field.

#Notice that the first 8 rows are just regions. Conveniently, column X__5 is empty for these rows, so if we just want country-country, we can remove these rows with:
UN_filter <- filter(UN_MigrantStockByOriginAndDestination_2015, !is.na(X__5))

#Remove unnecessary columns. We just want migrants from countries to countries. Use ‘-’ to delete columns and using ‘start_with’ function inside ‘select’ command to delete multiple columns whose names matche the text pattern of “Other”.
UN_select <- select(UN_filter, -X__1, -X__3, -X__5, -Total, -starts_with("Other"))

#rename columns
UN_renamed <- rename(UN_select, destination_country = X__2, country_code = X__4)

head(UN_renamed)
## # A tibble: 6 x 234
##   destination_country country_code Afghanistan Albania Algeria
##                 <chr>        <dbl>       <dbl>   <dbl>   <dbl>
## 1             Burundi          108          NA      NA      NA
## 2             Comoros          174          NA      NA      NA
## 3            Djibouti          262          NA      NA      NA
## 4             Eritrea          232          NA      NA      NA
## 5            Ethiopia          231          NA      NA      NA
## 6               Kenya          404          NA      NA      NA
## # ... with 229 more variables: `American Samoa` <dbl>, Andorra <dbl>,
## #   Angola <dbl>, Anguilla <dbl>, `Antigua and Barbuda` <dbl>,
## #   Argentina <dbl>, Armenia <dbl>, Aruba <dbl>, Australia <dbl>,
## #   Austria <dbl>, Azerbaijan <dbl>, Bahamas <dbl>, Bahrain <dbl>,
## #   Bangladesh <dbl>, Barbados <dbl>, Belarus <dbl>, Belgium <dbl>,
## #   Belize <dbl>, Benin <dbl>, Bermuda <dbl>, Bhutan <dbl>, `Bolivia
## #   (Plurinational State of)` <dbl>, `Bonaire, Sint Eustatius and
## #   Saba` <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## #   Brazil <dbl>, `British Virgin Islands` <dbl>, `Brunei
## #   Darussalam` <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>,
## #   Burundi <dbl>, `Cabo Verde` <dbl>, Cambodia <dbl>, Cameroon <dbl>,
## #   Canada <dbl>, `Cayman Islands` <dbl>, `Central African
## #   Republic` <dbl>, Chad <dbl>, `Channel Islands` <dbl>, Chile <dbl>,
## #   China <dbl>, `China, Hong Kong Special Administrative Region` <dbl>,
## #   `China, Macao Special Administrative Region` <dbl>, Colombia <dbl>,
## #   Comoros <dbl>, Congo <dbl>, `Cook Islands` <dbl>, `Costa Rica` <dbl>,
## #   `Côte d'Ivoire` <dbl>, Croatia <dbl>, Cuba <dbl>, Curaçao <dbl>,
## #   Cyprus <dbl>, `Czech Republic` <dbl>, `Democratic People's Republic of
## #   Korea` <dbl>, `Democratic Republic of the Congo` <dbl>, Denmark <dbl>,
## #   Djibouti <dbl>, Dominica <dbl>, `Dominican Republic` <dbl>,
## #   Ecuador <dbl>, Egypt <dbl>, `El Salvador` <dbl>, `Equatorial
## #   Guinea` <dbl>, Eritrea <dbl>, Estonia <dbl>, Ethiopia <dbl>, `Faeroe
## #   Islands` <dbl>, `Falkland Islands (Malvinas)` <dbl>, Fiji <dbl>,
## #   Finland <dbl>, France <dbl>, `French Guiana` <dbl>, `French
## #   Polynesia` <dbl>, Gabon <dbl>, Gambia <dbl>, Georgia <dbl>,
## #   Germany <dbl>, Ghana <dbl>, Gibraltar <dbl>, Greece <dbl>,
## #   Greenland <dbl>, Grenada <dbl>, Guadeloupe <dbl>, Guam <dbl>,
## #   Guatemala <dbl>, Guinea <dbl>, `Guinea-Bissau` <dbl>, Guyana <dbl>,
## #   Haiti <dbl>, `Holy See` <dbl>, Honduras <dbl>, Hungary <dbl>,
## #   Iceland <dbl>, India <dbl>, Indonesia <dbl>, `Iran (Islamic Republic
## #   of)` <dbl>, Iraq <dbl>, Ireland <dbl>, `Isle of Man` <dbl>, ...

Gather

Now, we have origin countries listed separate columns. If we wanted a single “origin country” variable, we need to gather these columns into a new pair of variables. To describe that operation we need three parameters:

  1. The set of columns that represent values, not variables. In this example, those are the columns Afghanistan:Zimbabwe.

  2. The key = the name of the variable whose values form the column names. In this exaple, let’s call this variable “origin countries”.

  3. The value = The name of the variable whose values are spread over the cells. Here, it is number of migrants.

In gather() you can select columns the same way you would with select() command (e.g.,starts_with(), contains() )

In the final result, the gathered columns are dropped, and we get new key and value columns. Here, we can see the estimated number of people who migrated for each pair of the countries.

#Note - there are many pairs of countries people didn't migrate from or to, so we can get rid of those by inserting "na.rm=True" at the end:
UN_tidy <- gather(UN_renamed, origin_country, migrants, Afghanistan:Zimbabwe, na.rm=TRUE)
head(UN_tidy)
## # A tibble: 6 x 4
##   destination_country country_code origin_country migrants
##                 <chr>        <dbl>          <chr>    <dbl>
## 1               Egypt          818    Afghanistan      235
## 2               Libya          434    Afghanistan      320
## 3             Namibia          516    Afghanistan       39
## 4        South Africa          710    Afghanistan       83
## 5          Tajikistan          762    Afghanistan     7587
## 6            Malaysia          458    Afghanistan      498

Spread

Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows.

For spread() you need 2 parameters:

  1. The key = column that contains variable names. For us, this is conveniently called “Key”

  2. The value = The name of the variable whose values are spread over the cells. Here, it “Value”.

#read in the data
library(readxl)
spreading_data <- read_excel("spreading_data.xlsx")
spreading_data
## # A tibble: 12 x 4
##        Country  Year        Key      Value
##          <chr> <dbl>      <chr>      <dbl>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583
tidy_spreading_data <- spread(spreading_data, key = Key, value = Value)
tidy_spreading_data
## # A tibble: 6 x 4
##       Country  Year  cases population
##         <chr> <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

spread() returns a copy of your data set that has had the key and value columns removed. In their place, spread() adds a new column for each unique value of the key column. These unique values will form the column names of the new columns. spread() distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.

spreading

spreading

spread() and gather() are complements.

  • gather() makes wide tables narrower and longer;

  • spread() makes long tables shorter and wider.

Separating and Uniting

Separate

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. This is helpful for time data. In this example, let’s use the “year” and separate this value into “century” then “year”:

#Separate the Year variable into 2 columns: "century" and "year". Take the last 2 digits to separate year.

separate_data <- separate(tidy_spreading_data, Year, into = c("century", "year"), sep = 2)
separate_data
## # A tibble: 6 x 5
##       Country century  year  cases population
##         <chr>   <chr> <chr>  <dbl>      <dbl>
## 1 Afghanistan      19    99    745   19987071
## 2 Afghanistan      20    00   2666   20595360
## 3      Brazil      19    99  37737  172006362
## 4      Brazil      20    00  80488  174504898
## 5       China      19    99 212258 1272915272
## 6       China      20    00 213766 1280428583

Unite

unite() is the inverse of separate(): it combines multiple columns into a single column.

Let’s undo what we just did:

unite_data <- unite(separate_data, Year, century, year)
unite_data
## # A tibble: 6 x 4
##       Country  Year  cases population
##         <chr> <chr>  <dbl>      <dbl>
## 1 Afghanistan 19_99    745   19987071
## 2 Afghanistan 20_00   2666   20595360
## 3      Brazil 19_99  37737  172006362
## 4      Brazil 20_00  80488  174504898
## 5       China 19_99 212258 1272915272
## 6       China 20_00 213766 1280428583
#In this case we also need to use the "sep" argument. The default will place an underscore (_) between the values from different columns. Here we don’t want any separator so we use "":
unite_data_clean <- unite(separate_data, Year, century, year, sep = "")
unite_data_clean
## # A tibble: 6 x 4
##       Country  Year  cases population
##         <chr> <chr>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

Missing Values

So far, we’ve just used na.rm = TRUE to remove any missing values (“NA’s”) in our data sets. But there are other helpful commands to use as well:

Complete()

Fill()

Let’s create some simple data, and try each one:

#Making a simple data set for this example

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66),
)
stocks
## # A tibble: 7 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4     NA
## 5  2016     2   0.92
## 6  2016     3   0.17
## 7  2016     4   2.66
  • The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA.

  • The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.

Complete() takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

stocks %>% 
  complete(year, qtr)
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4     NA
## 5  2016     1     NA
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66
#Now we can see another "NA" for 2016, 1st quarter.

There’s one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
treatment
## # A tibble: 4 x 3
##             person treatment response
##              <chr>     <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2             <NA>         2       10
## 3             <NA>         3        9
## 4  Katherine Burke         1        4

You can fill in these missing values with fill(). It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

treatment %>% 
  fill(person)
## # A tibble: 4 x 3
##             person treatment response
##              <chr>     <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 Derrick Whitmore         2       10
## 3 Derrick Whitmore         3        9
## 4  Katherine Burke         1        4
#Now the name from the previous row has filled the "NA's" for that column

Merging Data Sets

Suppose you have two data files, dataset1 and dataset2, that need to be merged into a single data set. First, read both data files in R. Then, use the merge() function to join the two data sets based on a unique id variable that is common to both data sets:

#Read in both data stets:
library(readxl)
dataset1 <- read_excel("dataset1.xlsx")
dataset1
## # A tibble: 13 x 6
##          Country CountryID        Region RegionID Population GDP_trill
##            <chr>     <dbl>         <chr>    <dbl>      <dbl>     <dbl>
##  1 United States       401 North America        1  285202936     18.57
##  2        Brazil       468 South America        2  934756203      2.50
##  3        Russia       274        Europe        3   93476732     12.70
##  4         Kenya       246        Africa        4   62450956      0.54
##  5        Canada       645 North America        1   68972038     19.86
##  6     Australia       797        Oceana        5   39457402     20.52
##  7         Japan       264          Asia        6  573822058     16.64
##  8         China       824          Asia        6   45063893     19.46
##  9      Thailand       724          Asia        6  343038468      2.54
## 10   New Zealand       869        Oceana        5    9348758     12.22
## 11         India       245          Asia        6  934957932      8.72
## 12        France       629        Europe        3     836585     15.88
## 13       Iceland       164        Europe        3    9298435     12.43
dataset2 <- read_excel("dataset2.xlsx")
dataset2
## # A tibble: 9 x 5
##         Country StateID        Region RegionID `Main sport`
##           <chr>   <dbl>         <chr>    <dbl>        <chr>
## 1 United States     401 North America        1     Football
## 2        Brazil     468 South America        2       Soccer
## 3        Russia     274        Europe        3       Soccer
## 4         Kenya     246        Africa        4       Soccer
## 5        Canada     645 North America        1   Ice Hockey
## 6     Australia     797        Oceana        5      Cricket
## 7         Japan     264          Asia        6     Baseball
## 8         China     824          Asia        6   Basketball
## 9      Thailand     724          Asia        6       Soccer
#Merge:
merged.data1 <- merge(dataset1, dataset2, by="Country")

merged.data is now an R object, which contains the two merged data sets. The data files were joined based on the id variable “countryID.”

It is possible to merge data files by more than one id variable:

merged.data2 <- merge(dataset1, dataset2, by=c("Country", "RegionID", "Region"))

It is also possible to merge the two files if the unique id variable has a different name in each data set. For example, the id variable may be called “countryID” in dataset1, but called “stateID” in dataset2:

#In this case, by.x calls the name of the id variable in dataset1, and by.y calls the name of the id variable in dataset2.
merged.data3 <- merge(dataset1, dataset2, by.x="CountryID", by.y="StateID")


#Note: The default setting of the merge() function drops all unmatched cases. If you want to keep all cases in the new data set, include the option all=TRUE in the merge() function:
merged.data.all <- merge(dataset1, dataset2, by="Country", all=TRUE)


#To keep unmatched cases only from dataset1, use the all.x option. 
merged.data.all <- merge(dataset1, dataset2, by="Country", all.x=TRUE)

#When all.x=TRUE, an extra row will be added to the output for each case in dataset1 that has no matching cases in dataset2. Cases that do not have values from dataset2 will be labeled as missing. 

#Similarly, to keep unmatched cases only from dataset2, use the all.y option:
merged.data.all <- merge(dataset1, dataset2, by="Country", all.y=TRUE)

#Again, when all.y=TRUE, an extra row will be added to the output for each case in dataset2 that has no matching cases in dataset1. Cases that do not have values from dataset1 will be labeled as missing.

Piping in R

From Chapter 18 R for Data Science (Grolemund, Wickham)

Pipes are used to combine a sequence of multiple operations

The pipe, %>%, comes from the magrittr package by Stefan Milton Bache, but is loaded with the tidyverse

Much of the time it is used to make the sequence easier to read

library(Lahman)
data <- Batting

new_data <- select(data, AB, H, HR)

#or with pipe
new_data2 <- data %>% select(AB, H, HR)

head(new_data == new_data2)
##        AB    H   HR
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE
#we can continue to add operations
new_data2 <- data %>% select(AB, H, HR) %>% mutate(AVG = H/AB)

head(new_data2)
##    AB  H HR       AVG
## 1   4  0  0 0.0000000
## 2 118 32  0 0.2711864
## 3 137 40  0 0.2919708
## 4 133 44  2 0.3308271
## 5 120 39  0 0.3250000
## 6  49 11  0 0.2244898

The authors warn to use pipes only when they are the best tool.

Do not use pipes when:

  1. You have more than 10 operations. This becomes confusing and difficult to debug
  2. Multiple inputs or outputs–piping is linear

~~ # References