This presentation aims to introduce a very brief summary on general manipulation of character strings and text data. We will cover string manipulations using the package stringr
and some functions from base R, along with how regular expression can be applied to pattern matching. We will also introduce text analysis examples using the package tidytext
.
As we will barely scratch the surface of strings, regular expression and text analysis, a resources section is available at the end of the presentation.
Reference
Anderson, D. (2017). EDLD610 - Exploring Data with R
Even if you don’t frequently conduct text analysis or text mining, character strings manipuliaton is useful for data wrangling:
You want to remove a given character in the names of your variables
You want to replace a given character in your data
You want to convert labels to upper case (or lower case)
You want to subset your data based on a specific pattern (e.g., extract phone numbers from a messy survey, 541-346-1234, 541.346.1234, 541-3461234, work: 541-3461234)
and more!
We will cover the stringr
package and some functions from base R.
The stringr package is part of tidyverse but many of the base R function are common enough to warrant some introduciton.
Strings can be anything wrapped in quotes, such as:
c("TRUE",
"7",
"z",
"3.14",
"pooya")
## [1] "TRUE" "7" "z" "3.14" "pooya"
The stringr
package contains built-in datasets we’ll be using as examples today such as:
head(fruit)
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
## [6] "bilberry"
head(sentences)
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
## [4] "These days a chicken leg is a rare dish."
## [5] "Rice is often served in round bowls."
## [6] "The juice of lemons makes fine punch."
head(words)
## [1] "a" "able" "about" "absolute" "accept" "account"
#stringr
str_to_upper(fruit) %>% head()
## [1] "APPLE" "APRICOT" "AVOCADO" "BANANA" "BELL PEPPER"
## [6] "BILBERRY"
str_to_lower(fruit) %>% head()
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
## [6] "bilberry"
#to title case in stringr
head(sentences)
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
## [4] "These days a chicken leg is a rare dish."
## [5] "Rice is often served in round bowls."
## [6] "The juice of lemons makes fine punch."
str_to_title(sentences) %>% head()
## [1] "The Birch Canoe Slid On The Smooth Planks."
## [2] "Glue The Sheet To The Dark Blue Background."
## [3] "It's Easy To Tell The Depth Of A Well."
## [4] "These Days A Chicken Leg Is A Rare Dish."
## [5] "Rice Is Often Served In Round Bowls."
## [6] "The Juice Of Lemons Makes Fine Punch."
#base
toupper(fruit) %>% head()
## [1] "APPLE" "APRICOT" "AVOCADO" "BANANA" "BELL PEPPER"
## [6] "BILBERRY"
tolower(fruit) %>% head()
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
## [6] "bilberry"
#to title case in base, uses the pre-installed tools package
tools::toTitleCase(sentences) %>% head()
## [1] "The Birch Canoe Slid on the Smooth Planks."
## [2] "Glue the Sheet to the Dark Blue Background."
## [3] "It's Easy to Tell the Depth of a Well."
## [4] "These Days a Chicken Leg is a Rare Dish."
## [5] "Rice is Often Served in Round Bowls."
## [6] "The Juice of Lemons Makes Fine Punch."
#stringr
str_c("red", "apple")
## [1] "redapple"
str_c("red", "apple", sep = " ")
## [1] "red apple"
str_c("red", "apple", sep = " : ")
## [1] "red : apple"
#base
paste0("red", "apple")
## [1] "redapple"
paste("red", "apple")
## [1] "red apple"
paste("red", "apple", sep = " : ")
## [1] "red : apple"
fruit[1:3]
## [1] "apple" "apricot" "avocado"
#stringr
str_length(fruit[1:3])
## [1] 5 7 7
#base
nchar(fruit[1:3])
## [1] 5 7 7
Subset strings in stringr
#stringr
fruit[1:5]
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
#return first 3 and print rest
str_sub(fruit[1:5], 3)
## [1] "ple" "ricot" "ocado" "nana" "ll pepper"
#only return character 3 to 6, includes spaces
str_sub(fruit[1:5], 3, 6)
## [1] "ple" "rico" "ocad" "nana" "ll p"
nchar(fruit[1:5])
## [1] 5 7 7 6 11
#return last 3 character
str_sub(fruit[1:5], -3)
## [1] "ple" "cot" "ado" "ana" "per"
Subset strings in base
#in base, needs to provide a stop argument
substr(fruit[1:5], 3, nchar(fruit[1:5]))
## [1] "ple" "ricot" "ocado" "nana" "ll pepper"
#this will generate an error
#substr(fruit[1:5], 3)
#the stop argument is 6
substr(fruit[1:5], 3, 6)
## [1] "ple" "rico" "ocad" "nana" "ll p"
#count number of character minus 2, and return character until end of string
substr(fruit[1:5], nchar(fruit[1:5]) - 2, nchar(fruit[1:5]))
## [1] "ple" "cot" "ado" "ana" "per"
Use str_sub
to modify strings
fruit[1:5]
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
#extract second to fourth character and replace with XX
replaced_fruit <- str_sub(fruit[1:5])
str_sub(replaced_fruit[1:5], 2, 4) <- "XX"
replaced_fruit
## [1] "aXXe" "aXXcot" "aXXado" "bXXna" "bXX pepper"
#notice that we specified to subset 3 characters but only replaced it with 2 characters
fruit[c(1:5, 60:65)] %>%
str_locate("ap")
## start end
## [1,] 1 2
## [2,] 1 2
## [3,] NA NA
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
## [7,] NA NA
## [8,] 5 6
## [9,] NA NA
## [10,] NA NA
## [11,] NA NA
fruit[c(1:5, 60:65)] %>%
str_locate("apple")
## start end
## [1,] 1 5
## [2,] NA NA
## [3,] NA NA
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
## [7,] NA NA
## [8,] 5 9
## [9,] NA NA
## [10,] NA NA
## [11,] NA NA
white_space <- c(" before", "after "," both ")
white_space
## [1] " before" "after " " both "
str_trim(white_space)
## [1] "before" "after" "both"
pad_space <- c("abc", "acbdefg")
#10 specifies the number of total characters you want in your string, and if it has less than the specified target, it will pad spaces until the target value is reached
str_pad(pad_space, 10)
## [1] " abc" " acbdefg"
str_pad(pad_space, 10, side = "right")
## [1] "abc " "acbdefg "
str_pad(pad_space, 10, side = "both")
## [1] " abc " " acbdefg "
string_nums <- as.character(1:15)
string_nums
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15"
str_pad(string_nums, 3, pad = "0")
## [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011"
## [12] "012" "013" "014" "015"
Regular expression is an “instruction” (or pattern) given to a function on what and how to match or replace strings (Eden, 2007).
We will briefly touch upon some regular expression with examples and provide resources for references at the end of the presentation.
stringr provides pattern matching functions to detect, locate, extract, match, replace, and split strings.
from Handling and Processing Strings in R, Sanchez, 2013
Regular expression can be used to:
identify match to a pattern: grep(..., value = FALSE)
, grepl()
, stringr::str_detect()
extract match to a pattern: grep(..., value = TRUE)
, stringr::str_extract()
, stringr::str_extract_all()
locate pattern within a string, e.g., give the start position of matched patterns. regexpr()
, gregexpr()
, stringr::str_locate()
, string::str_locate_all()
replace a pattern: sub()
, gsub()
, stringr::str_replace()
, stringr::str_replace_all()
split a string using a pattern: strsplit()
, stringr::str_split()
A helpful function when learning to locate pattern is the str_view
which allows you to see what R is doing behind the scene This requires the htmlwidgets
pacakge
str_view(sentences[1:5], "the")
str_view(fruit[1:5], "ap")
Search for a particular pattern in each element of a vector x. grep(pattern, x)
#let's find the location of all the sentences with the word "boy"
grep("boy", sentences)
## [1] 11 25 423 591 634 663 708
sentences[c(11, 423)]
## [1] "The boy was there when the sun rose."
## [2] "The boy owed his pal thirty cents."
#this will return all sentences that contains the string "red"
str_subset(sentences, "red")
## [1] "The colt reared and threw the tall rider."
## [2] "The wide road shimmered in the hot sun."
## [3] "See the cat glaring at the scared mouse."
## [4] "He ordered peach pie with ice cream."
## [5] "Pure bred poodles have curls."
## [6] "Mud was spattered on the front of his white shirt."
## [7] "The sofa cushion is red and of light weight."
## [8] "Torn scraps littered the stone floor."
## [9] "The doctor cured him with these pills."
## [10] "The new girl was fired today at noon."
## [11] "The third act was dull and tired the players."
## [12] "Lire wires should be kept covered."
## [13] "It is hard to erase blue or red ink."
## [14] "The wreck occurred by the bank on Main Street."
## [15] "The box is held by a bright red snapper."
## [16] "The prince ordered his head chopped off."
## [17] "The houses are built of red clay bricks."
## [18] "The red tape bound the smuggled food."
## [19] "Nine men were hired to dig the ruins."
## [20] "The flint sputtered and lit a pine torch."
## [21] "The old pan was covered with hard fudge."
## [22] "The store walls were lined with colored frocks."
## [23] "The clan gathered on each dull night."
## [24] "The lake sparkled in the red hot sun."
## [25] "Mark the spot with a sign painted red."
## [26] "Smoke poured out of every crack."
## [27] "Serve the hot rum to the tired heroes."
## [28] "He offered proof in the form of a lsrge chart."
## [29] "The sip of tea revives his tired friend."
## [30] "The door was barred, locked, and bolted as well."
## [31] "A thick coat of black paint covered all."
## [32] "The small red neon lamp went out."
## [33] "The green light in the brown box flickered."
## [34] "He put his last cartridge into the gun and fired."
## [35] "The ram scared the school children off."
## [36] "Dimes showered down from all sides."
## [37] "The sky in the west is tinged with orange red."
## [38] "The red paper brightened the dim stage."
## [39] "The hail pattered on the burnt brown grass."
## [40] "The big red apple fell to the ground."
#value = TRUE, returns the matching elements themselves instead of location when value = FALSE
grep("red", sentences, value = TRUE)
## [1] "The colt reared and threw the tall rider."
## [2] "The wide road shimmered in the hot sun."
## [3] "See the cat glaring at the scared mouse."
## [4] "He ordered peach pie with ice cream."
## [5] "Pure bred poodles have curls."
## [6] "Mud was spattered on the front of his white shirt."
## [7] "The sofa cushion is red and of light weight."
## [8] "Torn scraps littered the stone floor."
## [9] "The doctor cured him with these pills."
## [10] "The new girl was fired today at noon."
## [11] "The third act was dull and tired the players."
## [12] "Lire wires should be kept covered."
## [13] "It is hard to erase blue or red ink."
## [14] "The wreck occurred by the bank on Main Street."
## [15] "The box is held by a bright red snapper."
## [16] "The prince ordered his head chopped off."
## [17] "The houses are built of red clay bricks."
## [18] "The red tape bound the smuggled food."
## [19] "Nine men were hired to dig the ruins."
## [20] "The flint sputtered and lit a pine torch."
## [21] "The old pan was covered with hard fudge."
## [22] "The store walls were lined with colored frocks."
## [23] "The clan gathered on each dull night."
## [24] "The lake sparkled in the red hot sun."
## [25] "Mark the spot with a sign painted red."
## [26] "Smoke poured out of every crack."
## [27] "Serve the hot rum to the tired heroes."
## [28] "He offered proof in the form of a lsrge chart."
## [29] "The sip of tea revives his tired friend."
## [30] "The door was barred, locked, and bolted as well."
## [31] "A thick coat of black paint covered all."
## [32] "The small red neon lamp went out."
## [33] "The green light in the brown box flickered."
## [34] "He put his last cartridge into the gun and fired."
## [35] "The ram scared the school children off."
## [36] "Dimes showered down from all sides."
## [37] "The sky in the west is tinged with orange red."
## [38] "The red paper brightened the dim stage."
## [39] "The hail pattered on the burnt brown grass."
## [40] "The big red apple fell to the ground."
#with regular expression \\w - match a word character
#\\W - match a non word character
str_subset(sentences, "\\w red")
## [1] "The sofa cushion is red and of light weight."
## [2] "It is hard to erase blue or red ink."
## [3] "The box is held by a bright red snapper."
## [4] "The houses are built of red clay bricks."
## [5] "The red tape bound the smuggled food."
## [6] "The lake sparkled in the red hot sun."
## [7] "Mark the spot with a sign painted red."
## [8] "The small red neon lamp went out."
## [9] "The sky in the west is tinged with orange red."
## [10] "The red paper brightened the dim stage."
## [11] "The big red apple fell to the ground."
#returns the number of "the" in the first 100 sentences
str_count(sentences, "the")[1:100]
## [1] 1 2 1 0 0 0 1 0 0 0 2 0 2 1 1 1 0 1 1 1 1 1 2 0 2 1 0 1 1 0 1 2 2 1 1
## [36] 0 0 0 1 1 2 1 1 1 1 2 1 1 1 0 1 1 0 1 0 0 2 0 1 2 2 1 1 1 1 0 1 1 2 2
## [71] 2 1 1 0 0 0 2 0 0 0 1 2 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 0
#good to use in filter
str_detect(sentences, "the")[1:100]
## [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [12] FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
## [34] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
## [56] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## [78] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [89] TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [100] FALSE
grepl("the", sentences)[1:100]
## [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [12] FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
## [34] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
## [56] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## [78] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [89] TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [100] FALSE
An example that uses filter with str_detect
#create a hypothetical dataset
P_ID <- 1:50
num_fruit <- abs(round(rnorm(50, 3, 2),0))
## Warning in rnorm(50, 3, 2): '.Random.seed' is not an integer vector but of
## type 'NULL', so ignored
type_fruit <- rep(fruit[1:50])
fruit_eaters <- as.data.frame(cbind(P_ID, num_fruit, type_fruit)) %>%
mutate(type_fruit = as.character(type_fruit))
## Warning: package 'bindrcpp' was built under R version 3.4.4
head(fruit_eaters)
## P_ID num_fruit type_fruit
## 1 1 1 apple
## 2 2 2 apricot
## 3 3 5 avocado
## 4 4 4 banana
## 5 5 2 bell pepper
## 6 6 1 bilberry
#filter for berry eater
berry_eaters <- filter(fruit_eaters, str_detect(type_fruit, "berry"))
berry_eaters
## P_ID num_fruit type_fruit
## 1 6 1 bilberry
## 2 7 1 blackberry
## 3 10 4 blueberry
## 4 11 1 boysenberry
## 5 19 1 cloudberry
## 6 21 3 cranberry
## 7 29 3 elderberry
## 8 32 5 goji berry
## 9 33 2 gooseberry
## 10 38 3 huckleberry
## 11 50 5 mulberry
Metacharacters are special characters that define specific operations
Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string. This is accomplished with the help of metacharacters that have specific meaning: $ * + . ? [ ] ^ { } | ( ) \.
Quantifier Matching
Specify how many repetitions of the pattern.
*
: matches at least 0 times.
+
: matches at least 1 times.
?
: matches at most 1 times.
{n}
: matches exactly n times.
{n,}
: matches at least n times.
{n,m}
: matches between n and m times.
strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")
#we are applying the quantifier to c
#return everything with a & b
grep("ac*b", strings, value = TRUE) #disregarding the amount of c
## [1] "ab" "acb" "accb" "acccb" "accccb"
grep("ac+b", strings, value = TRUE) #with at least 1 c
## [1] "acb" "accb" "acccb" "accccb"
grep("ac?b", strings, value = TRUE) #with at most 1 c
## [1] "ab" "acb"
grep("ac{2}b", strings, value = TRUE) #with exactly 2 c
## [1] "accb"
Position Matching
Matches position of pattern within the string
^
: matches the start of the string.
$
: matches the end of the string.
\b
: matches the empty string at either edge of a word. Donât confuse it with ^ $ which marks the edge of a string.
\B
: matches the empty string provided it is not at an edge of a word.
itemIDs <- c("RF3L02E03", "RF3M08E05", "RF3H10E08", "RL1L03E03", "RL1M05E05", "RL1H10E04", "RI2L03E07", "RI2M06E05", "RI2HSAMPLEE06", "WR4L02E03", "WR4M06E06", "WR4H09E03", "WR9L03E04", "WR9M08E04", "WR9H12E04", "LA1L01E11", "LA1M06E04", "LA1H09E04", "WR2L03E05", "WR2M06E05", "WR2H10E05", "LA2LSAMPLEE03", "LA2M06E04", "LA2H09E08", "RF4L02E03", "RF4M08E06", "RF4H09E07", "RL7L02E07", "RL7M06E06", "RL7H10E06", "RI1L02E07", "RI1M07E08", "RI1H11E07", "WR1L02E07", "WR1M07E07", "WR1H11E08")
# select items that ends with 4 using regex
str_subset(itemIDs, "4$")
## [1] "RL1H10E04" "WR9L03E04" "WR9M08E04" "WR9H12E04" "LA1M06E04" "LA1H09E04"
## [7] "LA2M06E04"
Match the start of a string
#select items starting with RF
str_subset(itemIDs, "^RF")
## [1] "RF3L02E03" "RF3M08E05" "RF3H10E08" "RF4L02E03" "RF4M08E06" "RF4H09E07"
#select items starting with WR
str_subset(itemIDs, "^WR")
## [1] "WR4L02E03" "WR4M06E06" "WR4H09E03" "WR9L03E04" "WR9M08E04"
## [6] "WR9H12E04" "WR2L03E05" "WR2M06E05" "WR2H10E05" "WR1L02E07"
## [11] "WR1M07E07" "WR1H11E08"
It is useful to include ^ metacharacter incase there are matches with the same pattern that is not at the beginning.
from Sanchez, 2013
digits <- c("Charlie", "Charlie2", "Mary", "Marianne", "Mary2", "15")
#match a digit character
str_subset(digits, "\\d") #return anything containing digits
## [1] "Charlie2" "Mary2" "15"
#match a non-digit character
str_subset(digits, "\\D")
## [1] "Charlie" "Charlie2" "Mary" "Marianne" "Mary2"
#return anything containing non digits. Even if there is 1 nondigit character, it will be returned
string <- c("School is fun. Especially recess. That's the best part. I love recess.")
#match a space character
str_replace_all(string, "\\s", "_") #replaced all spaces with "_"
## [1] "School_is_fun._Especially_recess._That's_the_best_part._I_love_recess."
#match a non space character
str_replace_all(string, "\\S", "_") #replaced all non space with "_"
## [1] "______ __ ____ __________ _______ ______ ___ ____ _____ _ ____ _______"
#match a word character
str_replace_all(string, "\\w", "z") #replace all characters within words with "z"
## [1] "zzzzzz zz zzz. zzzzzzzzzz zzzzzz. zzzz'z zzz zzzz zzzz. z zzzz zzzzzz."
#match a non word character
str_replace_all(string, "\\W", "_") #replace all non word chracters with "_"
## [1] "School_is_fun__Especially_recess__That_s_the_best_part__I_love_recess_"
#this is useful for matching a specific word in str_subset
str_subset(sentences, "red")[1:15]
## [1] "The colt reared and threw the tall rider."
## [2] "The wide road shimmered in the hot sun."
## [3] "See the cat glaring at the scared mouse."
## [4] "He ordered peach pie with ice cream."
## [5] "Pure bred poodles have curls."
## [6] "Mud was spattered on the front of his white shirt."
## [7] "The sofa cushion is red and of light weight."
## [8] "Torn scraps littered the stone floor."
## [9] "The doctor cured him with these pills."
## [10] "The new girl was fired today at noon."
## [11] "The third act was dull and tired the players."
## [12] "Lire wires should be kept covered."
## [13] "It is hard to erase blue or red ink."
## [14] "The wreck occurred by the bank on Main Street."
## [15] "The box is held by a bright red snapper."
str_subset(sentences, "\\w red")
## [1] "The sofa cushion is red and of light weight."
## [2] "It is hard to erase blue or red ink."
## [3] "The box is held by a bright red snapper."
## [4] "The houses are built of red clay bricks."
## [5] "The red tape bound the smuggled food."
## [6] "The lake sparkled in the red hot sun."
## [7] "Mark the spot with a sign painted red."
## [8] "The small red neon lamp went out."
## [9] "The sky in the west is tinged with orange red."
## [10] "The red paper brightened the dim stage."
## [11] "The big red apple fell to the ground."
We will cover basic text processing using the tidytext
package. However, many other packages exists such as quanteda
.
Examples are from the Tidytext Mining with R
Example with some text from Emily Dickinson.
text <- c("Because I could not stop for Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")
text
## [1] "Because I could not stop for Death -"
## [2] "He kindly stopped for me -"
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"
#and convert it into a dataframe
text_df <- data_frame(line = 1:4, text = text)
text_df
## # A tibble: 4 x 2
## line text
## <int> <chr>
## 1 1 Because I could not stop for Death -
## 2 2 He kindly stopped for me -
## 3 3 The Carriage held but just Ourselves -
## 4 4 and Immortality
The tidytext package follows Tidy data principles:
This means Tidytext format is a table with one-token-per-row. We need to tokenize (a.k.a. split into individual words) the text data. This is done using unnest_toakens
in tidytext.
unnest_tokens(output, input, token = “words”)
output: Output column to be created as string or symbol.
input: Input column that gets split as string or symbol.
token: Unit for tokenizing, or a custom tokenizing function. Built-in options are “words” (default), “characters”, “character_shingles”, “ngrams”, “skip_ngrams”, “sentences”, “lines”, “paragraphs”, and “regex”
text_df %>%
unnest_tokens(word, text)
## # A tibble: 20 x 2
## line word
## <int> <chr>
## 1 1 because
## 2 1 i
## 3 1 could
## 4 1 not
## 5 1 stop
## 6 1 for
## 7 1 death
## 8 2 he
## 9 2 kindly
## 10 2 stopped
## 11 2 for
## 12 2 me
## 13 3 the
## 14 3 carriage
## 15 3 held
## 16 3 but
## 17 3 just
## 18 3 ourselves
## 19 4 and
## 20 4 immortality
After running unnest_tokens
:
Punctuation has been stripped.
By default, unnest_tokens()
converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE
argument to turn off this behavior).
Now the data is in Tidy format, which allows for manipulation using tidyverse such as dplyr, tidyr, and ggplot2.
We are going to import Jane Austen’s 6 completed, published novels from the janeaustenr
package (Silge 2016), and transform them into a tidy format. The janeaustenr
package provides these texts in a one-row-per-line format, where a line in this context is analogous to a literal printed line in a physical book. Let’s start with that, and also use mutate() to annotate a linenumber quantity to keep track of lines in the original format and a chapter (using a regex) to find where all the chapters are.
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup()
original_books
## # A tibble: 73,422 x 4
## text book linenumber chapter
## <chr> <fct> <int> <int>
## 1 SENSE AND SENSIBILITY Sense & Sensibility 1 0
## 2 "" Sense & Sensibility 2 0
## 3 by Jane Austen Sense & Sensibility 3 0
## 4 "" Sense & Sensibility 4 0
## 5 (1811) Sense & Sensibility 5 0
## 6 "" Sense & Sensibility 6 0
## 7 "" Sense & Sensibility 7 0
## 8 "" Sense & Sensibility 8 0
## 9 "" Sense & Sensibility 9 0
## 10 CHAPTER 1 Sense & Sensibility 10 1
## # ... with 73,412 more rows
Now, we need to restructure it in the one-token-per-row format using the unnest_tokens()
function.
tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
## 7 Sense & Sensibility 5 0 1811
## 8 Sense & Sensibility 10 1 chapter
## 9 Sense & Sensibility 10 1 1
## 10 Sense & Sensibility 13 1 the
## # ... with 725,045 more rows
Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words
) with anti_join()
. Anti_join return all rows from x where there are not matching values in y, keeping just columns from x.
You can also customize stop words (example further down)
#If stop words is not applied
tidy_books %>%
count(word, sort = TRUE)
## # A tibble: 14,520 x 2
## word n
## <chr> <int>
## 1 the 26351
## 2 to 24044
## 3 and 22515
## 4 of 21178
## 5 a 13408
## 6 her 13055
## 7 i 12006
## 8 in 11217
## 9 was 11204
## 10 it 10234
## # ... with 14,510 more rows
#see that the most common words are: "the", "to", "and", etc...
#apply the stop word
data(stop_words)
tidy_books <- tidy_books %>%
anti_join(stop_words)
## Joining, by = "word"
We can also use dplyr’s count()
to find the most common words in all the books as a whole and plot it in ggplot2.
#count
tidy_books %>%
count(word, sort = TRUE)
## # A tibble: 13,914 x 2
## word n
## <chr> <int>
## 1 miss 1855
## 2 time 1337
## 3 fanny 862
## 4 dear 822
## 5 lady 817
## 6 sir 806
## 7 day 797
## 8 emma 787
## 9 sister 727
## 10 house 699
## # ... with 13,904 more rows
#ggplot
tidy_books %>%
count(word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tools of text mining to approach the emotional content of text programmatically.
There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package contains several sentiment lexicons in the sentiments
dataset.
sample_n(sentiments, size = 12)
## # A tibble: 12 x 4
## word sentiment lexicon score
## <chr> <chr> <chr> <int>
## 1 invincibility positive bing NA
## 2 appreciating <NA> AFINN 2
## 3 fresh <NA> AFINN 1
## 4 recuse litigious loughran NA
## 5 refused <NA> AFINN -2
## 6 bacteria negative nrc NA
## 7 crack negative nrc NA
## 8 congress disgust nrc NA
## 9 successfully positive loughran NA
## 10 fancier positive bing NA
## 11 satirical negative bing NA
## 12 nepotism anger nrc NA
There are three sentiment lexicons in sentiments
, AFFIN
, bing
, nrc
, each with certain properties (can pick which suits your project’s need).
The nrc
lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing
lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN
lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
To get specific sentiment lexicons, you can use get_sentiments()
.
We already imported Jane Austen’s books in the previous section. Let’s perform sentiment analysis on the most common “joy” and “anger” words in the book Emma using inner_join
. Inner-join return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
#joy
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 298 x 2
## word n
## <chr> <int>
## 1 friend 166
## 2 hope 143
## 3 happy 125
## 4 love 117
## 5 deal 92
## 6 found 92
## 7 happiness 76
## 8 pretty 68
## 9 true 66
## 10 comfort 65
## # ... with 288 more rows
#anger
nrc_anger <- get_sentiments("nrc") %>%
filter(sentiment == "anger")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_anger) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 313 x 2
## word n
## <chr> <int>
## 1 ill 72
## 2 bad 60
## 3 feeling 56
## 4 bear 52
## 5 words 49
## 6 obliging 34
## 7 evil 33
## 8 difficulty 30
## 9 spite 24
## 10 loss 23
## # ... with 303 more rows
Combining count
with word
and sentiment
, we can find out how much each word contributed to each sentiment.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,555 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 happy positive 534
## 3 love positive 495
## 4 pleasure positive 462
## 5 poor negative 424
## 6 happiness positive 369
## 7 comfort positive 292
## 8 doubt negative 281
## 9 affection positive 272
## 10 perfectly positive 271
## # ... with 2,545 more rows
#Now lets plot it
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
We can see an anomaly related to the word “miss” –it is coded as negative but it is used as a title for young, unmarried women in Jane Austen’s works. We can add “miss” to a custom stop-words list using bind_rows()
:
custom_stop_words <- bind_rows(data_frame(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,150 x 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ... with 1,140 more rows
Wordclouds can be made using the package wordcloud
.
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
Most common positive and negative words in Jane Austen’s novels
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining, by = "word"
Some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that
I am not having a good day.
is a sad sentence, not a happy one, because of negation.
R packages such as coreNLP (T. Arnold and Tilton 2016), cleanNLP (T. B. Arnold 2016), and sentimentr (Rinker 2017) are examples of such sentiment analysis algorithms. For these, we may want to tokenize text into sentences, and it makes sense to use a new name for the output column in such a case.