Today we will cover:
1. Cleaning your data (janitor or other standard naming practices)
2. dplyr (part of the tidyverse)
3. tidyr
4. Merging Data Sets
5. Piping
6. HACK! (Separate File)
# Clear workspace
rm(list = ls())
# Set working directory
setwd("~/Desktop/Data Science/W3_Wrangling")
# Knit options
chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
# Install and load required packages. You can install the complete tidyverse with a single line of code:
#install.packages("tidyverse")
(when installing tidyverse, note the conflicts message. This tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter()
and stats::lag()
.
#install.packages("Lahman") #This is Lahman's baseball dataset
#install.packages("janitor")
#install.packages("readxl")
It is helpful standardize names for all your variables, and remove any empty rows and columns. You can do this using the “janitor” library. This is especially useful when merging datasets that contain similar variables!
clean_names() - standardizes all names according to snake_case
remove_empty_rows()
remove_empty_cols()
Let’s try it:
library(readxl)
netflix <- read_excel("Netflix.xlsx") #import data set
netflix
## # A tibble: 11 x 4
## `Hours of Netflix` Pets GPA `empty column`
## <dbl> <dbl> <dbl> <lgl>
## 1 5 1 3.00 NA
## 2 8 3 2.50 NA
## 3 12 0 2.00 NA
## 4 4 2 3.50 NA
## 5 1 1 3.80 NA
## 6 3 1 3.80 NA
## 7 3 0 4.00 NA
## 8 NA NA NA NA
## 9 NA NA NA NA
## 10 4 0 2.50 NA
## 11 4 1 3.25 NA
library(janitor)
clean_netflix <- clean_names(netflix)
clean_netflix <- remove_empty_rows(clean_netflix)
clean_netflix <- remove_empty_cols(clean_netflix)
clean_netflix
## # A tibble: 9 x 3
## hours_of_netflix pets gpa
## <dbl> <dbl> <dbl>
## 1 5 1 3.00
## 2 8 3 2.50
## 3 12 0 2.00
## 4 4 2 3.50
## 5 1 1 3.80
## 6 3 1 3.80
## 7 3 0 4.00
## 8 4 0 2.50
## 9 4 1 3.25
You can also create a new data frame with ALL missing values removed with: newdata <- na.omit(mydata)
There are five key dplyr functions that allow you to solve the vast majority of data manipulation challenges:
I want to… | Function |
---|---|
1. Pick observations by their values | filter() |
2. Pick variables by their names | select() |
3. Reorder rows | arrange() |
4. Create new variables from existing variables | mutate() |
5. Collapse many values down to a single summary | summarise() |
These can all be used in conjunction with group_by()
which changes the scope of each function from operating on the entire dataset to operating on it group-by-group
For all of these functions:
The first argument is a data frame
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
The result is a new data frame.
filter()allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, let’s pull out just the most recent batting data (2016), then just look at the Baltimore Orioles:
library(Lahman)
library(tidyverse)
## ── Attaching packages ─── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.3.4 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
data <- filter(Batting, yearID == 2016)
head(data)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 1 abadfe01 2016 1 MIN AL 39 1 0 0 0 0 0 0 0 0
## 2 abadfe01 2016 2 BOS AL 18 0 0 0 0 0 0 0 0 0
## 3 abreujo02 2016 1 CHA AL 159 624 67 183 32 1 25 100 0 2
## 4 achteaj01 2016 1 LAA AL 27 0 0 0 0 0 0 0 0 0
## 5 ackledu01 2016 1 NYA AL 28 61 6 9 0 0 0 4 0 0
## 6 adamecr01 2016 1 COL NL 121 225 25 49 7 3 2 17 2 3
## BB SO IBB HBP SH SF GIDP
## 1 0 1 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 47 125 7 15 0 9 21
## 4 0 0 0 0 0 0 0
## 5 8 9 0 0 0 1 0
## 6 24 47 0 4 3 0 5
os_data <- filter(data, teamID == "BAL")
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).
Symbols | Meaning |
---|---|
‘>’ | Greater than |
‘>=’ | Greater than or equal |
‘<’ | Less than |
‘<=’ | Less than or equal |
‘!=’ | Not equal |
‘=’ | Equal |
‘&’ | and |
‘|’ | or |
‘!’ | not |
So, you can get rid of all the pitchers and players who were only briefly on the team by removing everyone with a low number of ABs:
os_data <- filter(os_data, !(AB == 0)) #Remove anyone who's AB = 0
os_data <- filter(os_data, !(AB < 100)) #Remove anyone who's AB < 100
Figure 5.1 in Wickham’s online book shows the complete set of Boolean operations.
Now we want to narrow down the data to just the variables we are interested in. There are several ways of doing this:
# Select columns by name - new data frame includes only the named variables
hr <- select(os_data, playerID, AB, HR)
# Select all columns between (inclusive)
playerID_to_RBI <- select(os_data, playerID:RBI)
# Select all columns except those from year to "lgID" (inclusive)
all_except <- select(os_data, -(yearID:lgID))
There are a number of helper functions you can use within select():
starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
See ?select
for more details.
arrange()
simply re-arranges the order of rows according to a certain column. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
Let’s go back to the data that includes just 3 variables - player ID, AB, and HR. We named this data frame “hr”
#Re-arragne hr data set so rows go from lowest to highest hit rate. This is the default.
hr_arranged <- arrange(hr, HR, AB)
#Use desc() to re-order by a column in descending order:
hr_desc <- arrange(hr, desc(HR))
hr_desc
## playerID AB HR
## 1 trumbma01 613 47
## 2 davisch02 566 38
## 3 machama01 640 37
## 4 jonesad01 619 29
## 5 schoojo01 615 25
## 6 alvarpe01 337 22
## 7 wietema01 423 17
## 8 hardyjj01 405 9
## 9 kimhy01 305 6
## 10 reimono01 203 6
## 11 rickajo01 257 5
## 12 flahery01 157 3
## 13 josepca01 132 0
Mutate ads new columns as a function of current variables. Let’s get our own estimate of batting average (hit rate / at bats) and add it to our data frame.
hr_desc <- mutate(hr_desc, BA = HR/AB)
hr_desc
## playerID AB HR BA
## 1 trumbma01 613 47 0.07667210
## 2 davisch02 566 38 0.06713781
## 3 machama01 640 37 0.05781250
## 4 jonesad01 619 29 0.04684976
## 5 schoojo01 615 25 0.04065041
## 6 alvarpe01 337 22 0.06528190
## 7 wietema01 423 17 0.04018913
## 8 hardyjj01 405 9 0.02222222
## 9 kimhy01 305 6 0.01967213
## 10 reimono01 203 6 0.02955665
## 11 rickajo01 257 5 0.01945525
## 12 flahery01 157 3 0.01910828
## 13 josepca01 132 0 0.00000000
#If you only want to keep the newly created variable, use transmute.
hr_desc <- transmute(hr_desc, playerID = playerID, BA = HR/AB)
hr_desc
## playerID BA
## 1 trumbma01 0.07667210
## 2 davisch02 0.06713781
## 3 machama01 0.05781250
## 4 jonesad01 0.04684976
## 5 schoojo01 0.04065041
## 6 alvarpe01 0.06528190
## 7 wietema01 0.04018913
## 8 hardyjj01 0.02222222
## 9 kimhy01 0.01967213
## 10 reimono01 0.02955665
## 11 rickajo01 0.01945525
## 12 flahery01 0.01910828
## 13 josepca01 0.00000000
In mutate, you can use basic arithmatic, or use functions such as: * sum()
x + sum(x) * mean()
y - mean(y) * log()
or log10()
The last key verb is summarise()
. It collapses a data frame to a single row. It’s not really useful unless we pair it with group_by()
. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.
Useful summarize functions: * mean() * median() * sum() * sd() * min() or max()
Adding na.rm
removes any missing values prior to computation
#Let's go back to our original 2016 Batting data. We named this "data".
#This is saying "Just give me the sum of hit rates per team":
by_teams <- group_by(data, teamID)
sum_hr_by_team <- summarise(by_teams, HR = sum(HR))
#Now, how about the average run's per team:
avgr_by_team <- summarise(by_teams, AvgR = mean(R))
#If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
When your data is tidy:
Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
(when installing tidyverse, note the conflicts message. This tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter()
and stats::lag()
.
There are two main advantages:
Generally, if you have one consistent data structure, it’s easy to work with several data sets.
Specifically, most built-in R functions work with vectors of values. All columns become vectors of values, which makes it easier to put our variables into functions.
dplyr
, ggplot2
, and all the other packages in the tidyverse are designed to work with tidy data.
For most real data, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
To fix these problems, you’ll need the two most important functions in tidyr: gather()
and spread()
.
In the following examples, we will use UN Migration Data from 2015. It comes in an “untidy” excel file.
The file contains a bunch of worksheets to include different years and data broken down by total / male / female. But lets just look at ‘Table 6’ which contains the total migrants data for 2015 for this post. We need to skip the first 15 rows, which is just the title, then the next 8 rows which contains “region” info (we will just look at country-country migrations). Here is the code for that:
library(readxl)
UN_MigrantStockByOriginAndDestination_2015 <- read_excel("UN_MigrantStockByOriginAndDestination_2015.xlsx",
sheet = "Table 16", skip = 15)
#View(UN_MigrantStockByOriginAndDestination_2015)
I’m a fan of drop-downs, so if you would like to use the dropdown to import this data, use the following:
File -> Import Dataset
Select ‘Table 16’ from Sheet dropdown
Enter “15” in the “Skip” field.
#Notice that the first 8 rows are just regions. Conveniently, column X__5 is empty for these rows, so if we just want country-country, we can remove these rows with:
UN_filter <- filter(UN_MigrantStockByOriginAndDestination_2015, !is.na(X__5))
#Remove unnecessary columns. We just want migrants from countries to countries. Use ‘-’ to delete columns and using ‘start_with’ function inside ‘select’ command to delete multiple columns whose names matche the text pattern of “Other”.
UN_select <- select(UN_filter, -X__1, -X__3, -X__5, -Total, -starts_with("Other"))
#rename columns
UN_renamed <- rename(UN_select, destination_country = X__2, country_code = X__4)
head(UN_renamed)
## # A tibble: 6 x 234
## destination_country country_code Afghanistan Albania Algeria
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Burundi 108 NA NA NA
## 2 Comoros 174 NA NA NA
## 3 Djibouti 262 NA NA NA
## 4 Eritrea 232 NA NA NA
## 5 Ethiopia 231 NA NA NA
## 6 Kenya 404 NA NA NA
## # ... with 229 more variables: `American Samoa` <dbl>, Andorra <dbl>,
## # Angola <dbl>, Anguilla <dbl>, `Antigua and Barbuda` <dbl>,
## # Argentina <dbl>, Armenia <dbl>, Aruba <dbl>, Australia <dbl>,
## # Austria <dbl>, Azerbaijan <dbl>, Bahamas <dbl>, Bahrain <dbl>,
## # Bangladesh <dbl>, Barbados <dbl>, Belarus <dbl>, Belgium <dbl>,
## # Belize <dbl>, Benin <dbl>, Bermuda <dbl>, Bhutan <dbl>, `Bolivia
## # (Plurinational State of)` <dbl>, `Bonaire, Sint Eustatius and
## # Saba` <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## # Brazil <dbl>, `British Virgin Islands` <dbl>, `Brunei
## # Darussalam` <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>,
## # Burundi <dbl>, `Cabo Verde` <dbl>, Cambodia <dbl>, Cameroon <dbl>,
## # Canada <dbl>, `Cayman Islands` <dbl>, `Central African
## # Republic` <dbl>, Chad <dbl>, `Channel Islands` <dbl>, Chile <dbl>,
## # China <dbl>, `China, Hong Kong Special Administrative Region` <dbl>,
## # `China, Macao Special Administrative Region` <dbl>, Colombia <dbl>,
## # Comoros <dbl>, Congo <dbl>, `Cook Islands` <dbl>, `Costa Rica` <dbl>,
## # `Côte d'Ivoire` <dbl>, Croatia <dbl>, Cuba <dbl>, Curaçao <dbl>,
## # Cyprus <dbl>, `Czech Republic` <dbl>, `Democratic People's Republic of
## # Korea` <dbl>, `Democratic Republic of the Congo` <dbl>, Denmark <dbl>,
## # Djibouti <dbl>, Dominica <dbl>, `Dominican Republic` <dbl>,
## # Ecuador <dbl>, Egypt <dbl>, `El Salvador` <dbl>, `Equatorial
## # Guinea` <dbl>, Eritrea <dbl>, Estonia <dbl>, Ethiopia <dbl>, `Faeroe
## # Islands` <dbl>, `Falkland Islands (Malvinas)` <dbl>, Fiji <dbl>,
## # Finland <dbl>, France <dbl>, `French Guiana` <dbl>, `French
## # Polynesia` <dbl>, Gabon <dbl>, Gambia <dbl>, Georgia <dbl>,
## # Germany <dbl>, Ghana <dbl>, Gibraltar <dbl>, Greece <dbl>,
## # Greenland <dbl>, Grenada <dbl>, Guadeloupe <dbl>, Guam <dbl>,
## # Guatemala <dbl>, Guinea <dbl>, `Guinea-Bissau` <dbl>, Guyana <dbl>,
## # Haiti <dbl>, `Holy See` <dbl>, Honduras <dbl>, Hungary <dbl>,
## # Iceland <dbl>, India <dbl>, Indonesia <dbl>, `Iran (Islamic Republic
## # of)` <dbl>, Iraq <dbl>, Ireland <dbl>, `Isle of Man` <dbl>, ...
Now, we have origin countries listed separate columns. If we wanted a single “origin country” variable, we need to gather these columns into a new pair of variables. To describe that operation we need three parameters:
The set of columns that represent values, not variables. In this example, those are the columns Afghanistan:Zimbabwe.
The key = the name of the variable whose values form the column names. In this exaple, let’s call this variable “origin countries”.
The value = The name of the variable whose values are spread over the cells. Here, it is number of migrants.
In gather()
you can select columns the same way you would with select()
command (e.g.,starts_with()
, contains()
)
In the final result, the gathered columns are dropped, and we get new key and value columns. Here, we can see the estimated number of people who migrated for each pair of the countries.
#Note - there are many pairs of countries people didn't migrate from or to, so we can get rid of those by inserting "na.rm=True" at the end:
UN_tidy <- gather(UN_renamed, origin_country, migrants, Afghanistan:Zimbabwe, na.rm=TRUE)
head(UN_tidy)
## # A tibble: 6 x 4
## destination_country country_code origin_country migrants
## <chr> <dbl> <chr> <dbl>
## 1 Egypt 818 Afghanistan 235
## 2 Libya 434 Afghanistan 320
## 3 Namibia 516 Afghanistan 39
## 4 South Africa 710 Afghanistan 83
## 5 Tajikistan 762 Afghanistan 7587
## 6 Malaysia 458 Afghanistan 498
Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows.
For spread()
you need 2 parameters:
The key = column that contains variable names. For us, this is conveniently called “Key”
The value = The name of the variable whose values are spread over the cells. Here, it “Value”.
#read in the data
library(readxl)
spreading_data <- read_excel("spreading_data.xlsx")
spreading_data
## # A tibble: 12 x 4
## Country Year Key Value
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
tidy_spreading_data <- spread(spreading_data, key = Key, value = Value)
tidy_spreading_data
## # A tibble: 6 x 4
## Country Year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
spread()
returns a copy of your data set that has had the key and value columns removed. In their place, spread()
adds a new column for each unique value of the key column. These unique values will form the column names of the new columns. spread()
distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
spread()
and gather()
are complements.
gather()
makes wide tables narrower and longer;
spread()
makes long tables shorter and wider.
separate()
pulls apart one column into multiple columns, by splitting wherever a separator character appears. This is helpful for time data. In this example, let’s use the “year” and separate this value into “century” then “year”:
#Separate the Year variable into 2 columns: "century" and "year". Take the last 2 digits to separate year.
separate_data <- separate(tidy_spreading_data, Year, into = c("century", "year"), sep = 2)
separate_data
## # A tibble: 6 x 5
## Country century year cases population
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 19 99 745 19987071
## 2 Afghanistan 20 00 2666 20595360
## 3 Brazil 19 99 37737 172006362
## 4 Brazil 20 00 80488 174504898
## 5 China 19 99 212258 1272915272
## 6 China 20 00 213766 1280428583
unite()
is the inverse of separate(): it combines multiple columns into a single column.
Let’s undo what we just did:
unite_data <- unite(separate_data, Year, century, year)
unite_data
## # A tibble: 6 x 4
## Country Year cases population
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 19_99 745 19987071
## 2 Afghanistan 20_00 2666 20595360
## 3 Brazil 19_99 37737 172006362
## 4 Brazil 20_00 80488 174504898
## 5 China 19_99 212258 1272915272
## 6 China 20_00 213766 1280428583
#In this case we also need to use the "sep" argument. The default will place an underscore (_) between the values from different columns. Here we don’t want any separator so we use "":
unite_data_clean <- unite(separate_data, Year, century, year, sep = "")
unite_data_clean
## # A tibble: 6 x 4
## Country Year cases population
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
So far, we’ve just used na.rm = TRUE
to remove any missing values (“NA’s”) in our data sets. But there are other helpful commands to use as well:
Complete()
Fill()
Let’s create some simple data, and try each one:
#Making a simple data set for this example
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66),
)
stocks
## # A tibble: 7 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 2 0.92
## 6 2016 3 0.17
## 7 2016 4 2.66
The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA.
The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
Complete()
takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.
stocks %>%
complete(year, qtr)
## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
#Now we can see another "NA" for 2016, 1st quarter.
There’s one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
treatment
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 Katherine Burke 1 4
You can fill in these missing values with fill()
. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
treatment %>%
fill(person)
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4
#Now the name from the previous row has filled the "NA's" for that column
Suppose you have two data files, dataset1
and dataset2
, that need to be merged into a single data set. First, read both data files in R. Then, use the merge()
function to join the two data sets based on a unique id variable that is common to both data sets:
#Read in both data stets:
library(readxl)
dataset1 <- read_excel("dataset1.xlsx")
dataset1
## # A tibble: 13 x 6
## Country CountryID Region RegionID Population GDP_trill
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 United States 401 North America 1 285202936 18.57
## 2 Brazil 468 South America 2 934756203 2.50
## 3 Russia 274 Europe 3 93476732 12.70
## 4 Kenya 246 Africa 4 62450956 0.54
## 5 Canada 645 North America 1 68972038 19.86
## 6 Australia 797 Oceana 5 39457402 20.52
## 7 Japan 264 Asia 6 573822058 16.64
## 8 China 824 Asia 6 45063893 19.46
## 9 Thailand 724 Asia 6 343038468 2.54
## 10 New Zealand 869 Oceana 5 9348758 12.22
## 11 India 245 Asia 6 934957932 8.72
## 12 France 629 Europe 3 836585 15.88
## 13 Iceland 164 Europe 3 9298435 12.43
dataset2 <- read_excel("dataset2.xlsx")
dataset2
## # A tibble: 9 x 5
## Country StateID Region RegionID `Main sport`
## <chr> <dbl> <chr> <dbl> <chr>
## 1 United States 401 North America 1 Football
## 2 Brazil 468 South America 2 Soccer
## 3 Russia 274 Europe 3 Soccer
## 4 Kenya 246 Africa 4 Soccer
## 5 Canada 645 North America 1 Ice Hockey
## 6 Australia 797 Oceana 5 Cricket
## 7 Japan 264 Asia 6 Baseball
## 8 China 824 Asia 6 Basketball
## 9 Thailand 724 Asia 6 Soccer
#Merge:
merged.data1 <- merge(dataset1, dataset2, by="Country")
merged.data
is now an R object, which contains the two merged data sets. The data files were joined based on the id variable “countryID.”
It is possible to merge data files by more than one id variable:
merged.data2 <- merge(dataset1, dataset2, by=c("Country", "RegionID", "Region"))
It is also possible to merge the two files if the unique id variable has a different name in each data set. For example, the id variable may be called “countryID” in dataset1
, but called “stateID” in dataset2
:
#In this case, by.x calls the name of the id variable in dataset1, and by.y calls the name of the id variable in dataset2.
merged.data3 <- merge(dataset1, dataset2, by.x="CountryID", by.y="StateID")
#Note: The default setting of the merge() function drops all unmatched cases. If you want to keep all cases in the new data set, include the option all=TRUE in the merge() function:
merged.data.all <- merge(dataset1, dataset2, by="Country", all=TRUE)
#To keep unmatched cases only from dataset1, use the all.x option.
merged.data.all <- merge(dataset1, dataset2, by="Country", all.x=TRUE)
#When all.x=TRUE, an extra row will be added to the output for each case in dataset1 that has no matching cases in dataset2. Cases that do not have values from dataset2 will be labeled as missing.
#Similarly, to keep unmatched cases only from dataset2, use the all.y option:
merged.data.all <- merge(dataset1, dataset2, by="Country", all.y=TRUE)
#Again, when all.y=TRUE, an extra row will be added to the output for each case in dataset2 that has no matching cases in dataset1. Cases that do not have values from dataset1 will be labeled as missing.
The pipe, %>%
, comes from the magrittr package by Stefan Milton Bache, but is loaded with the tidyverse
library(Lahman)
data <- Batting
new_data <- select(data, AB, H, HR)
#or with pipe
new_data2 <- data %>% select(AB, H, HR)
head(new_data == new_data2)
## AB H HR
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE
#we can continue to add operations
new_data2 <- data %>% select(AB, H, HR) %>% mutate(AVG = H/AB)
head(new_data2)
## AB H HR AVG
## 1 4 0 0 0.0000000
## 2 118 32 0 0.2711864
## 3 137 40 0 0.2919708
## 4 133 44 2 0.3308271
## 5 120 39 0 0.3250000
## 6 49 11 0 0.2244898