Introduction

This presentation aims to introduce a very brief summary on general manipulation of character strings and text data. We will cover string manipulations using the package stringr and some functions from base R, along with how regular expression can be applied to pattern matching. We will also introduce text analysis examples using the package tidytext.

As we will barely scratch the surface of strings, regular expression and text analysis, a resources section is available at the end of the presentation.

Reference

Anderson, D. (2017). EDLD610 - Exploring Data with R

Why?

Even if you don’t frequently conduct text analysis or text mining, character strings manipuliaton is useful for data wrangling:

  • You want to remove a given character in the names of your variables

  • You want to replace a given character in your data

  • You want to convert labels to upper case (or lower case)

  • You want to subset your data based on a specific pattern (e.g., extract phone numbers from a messy survey, 541-346-1234, 541.346.1234, 541-3461234, work: 541-3461234)

  • and more!

String Manipulations

We will cover the stringr package and some functions from base R.

The stringr package is part of tidyverse but many of the base R function are common enough to warrant some introduciton.

Strings can be anything wrapped in quotes, such as:

c("TRUE",
"7",
"z",
"3.14",
"pooya")
## [1] "TRUE"  "7"     "z"     "3.14"  "pooya"

The stringr package contains built-in datasets we’ll be using as examples today such as:

head(fruit)
## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberry"
head(sentences)
## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."       
## [6] "The juice of lemons makes fine punch."
head(words)
## [1] "a"        "able"     "about"    "absolute" "accept"   "account"

String manipulations in R and stringr

From Handling and Processing Strings in R Sanchez, 2013

Upper, lower, & title case

#stringr 
str_to_upper(fruit) %>% head()
## [1] "APPLE"       "APRICOT"     "AVOCADO"     "BANANA"      "BELL PEPPER"
## [6] "BILBERRY"
str_to_lower(fruit) %>% head()
## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberry"
#to title case in stringr
head(sentences)
## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."       
## [6] "The juice of lemons makes fine punch."
str_to_title(sentences) %>% head()
## [1] "The Birch Canoe Slid On The Smooth Planks." 
## [2] "Glue The Sheet To The Dark Blue Background."
## [3] "It's Easy To Tell The Depth Of A Well."     
## [4] "These Days A Chicken Leg Is A Rare Dish."   
## [5] "Rice Is Often Served In Round Bowls."       
## [6] "The Juice Of Lemons Makes Fine Punch."
#base 
toupper(fruit) %>% head()
## [1] "APPLE"       "APRICOT"     "AVOCADO"     "BANANA"      "BELL PEPPER"
## [6] "BILBERRY"
tolower(fruit) %>% head()
## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberry"
#to title case in base, uses the pre-installed tools package
tools::toTitleCase(sentences) %>% head()
## [1] "The Birch Canoe Slid on the Smooth Planks." 
## [2] "Glue the Sheet to the Dark Blue Background."
## [3] "It's Easy to Tell the Depth of a Well."     
## [4] "These Days a Chicken Leg is a Rare Dish."   
## [5] "Rice is Often Served in Round Bowls."       
## [6] "The Juice of Lemons Makes Fine Punch."

Joining strings

#stringr
str_c("red", "apple")
## [1] "redapple"
str_c("red", "apple", sep = " ")
## [1] "red apple"
str_c("red", "apple", sep = " : ")
## [1] "red : apple"
#base 
paste0("red", "apple")
## [1] "redapple"
paste("red", "apple")
## [1] "red apple"
paste("red", "apple", sep = " : ")
## [1] "red : apple"

String length

fruit[1:3]
## [1] "apple"   "apricot" "avocado"
#stringr
str_length(fruit[1:3])
## [1] 5 7 7
#base
nchar(fruit[1:3])
## [1] 5 7 7

Subset strings

Subset strings in stringr

#stringr 

fruit[1:5]
## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
#return first 3 and print rest 
str_sub(fruit[1:5], 3)
## [1] "ple"       "ricot"     "ocado"     "nana"      "ll pepper"
#only return character 3 to 6, includes spaces
str_sub(fruit[1:5], 3, 6)
## [1] "ple"  "rico" "ocad" "nana" "ll p"
nchar(fruit[1:5])
## [1]  5  7  7  6 11
#return last 3 character
str_sub(fruit[1:5], -3)
## [1] "ple" "cot" "ado" "ana" "per"

Subset strings in base

#in base, needs to provide a stop argument 
substr(fruit[1:5], 3, nchar(fruit[1:5])) 
## [1] "ple"       "ricot"     "ocado"     "nana"      "ll pepper"
#this will generate an error
#substr(fruit[1:5], 3)

#the stop argument is 6
substr(fruit[1:5], 3, 6)
## [1] "ple"  "rico" "ocad" "nana" "ll p"
#count number of character minus 2, and return character until end of string 
substr(fruit[1:5], nchar(fruit[1:5]) - 2, nchar(fruit[1:5]))
## [1] "ple" "cot" "ado" "ana" "per"

Modify strings

Use str_sub to modify strings

fruit[1:5]
## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
#extract second to fourth character and replace with XX
replaced_fruit <- str_sub(fruit[1:5]) 
str_sub(replaced_fruit[1:5], 2, 4) <- "XX" 
replaced_fruit
## [1] "aXXe"       "aXXcot"     "aXXado"     "bXXna"      "bXX pepper"
#notice that we specified to subset 3 characters but only replaced it with 2 characters 

Locate where strings occur

fruit[c(1:5, 60:65)] %>% 
  str_locate("ap")
##       start end
##  [1,]     1   2
##  [2,]     1   2
##  [3,]    NA  NA
##  [4,]    NA  NA
##  [5,]    NA  NA
##  [6,]    NA  NA
##  [7,]    NA  NA
##  [8,]     5   6
##  [9,]    NA  NA
## [10,]    NA  NA
## [11,]    NA  NA
fruit[c(1:5, 60:65)] %>% 
  str_locate("apple")
##       start end
##  [1,]     1   5
##  [2,]    NA  NA
##  [3,]    NA  NA
##  [4,]    NA  NA
##  [5,]    NA  NA
##  [6,]    NA  NA
##  [7,]    NA  NA
##  [8,]     5   9
##  [9,]    NA  NA
## [10,]    NA  NA
## [11,]    NA  NA

Trim and pad white space

white_space <- c(" before", "after "," both ")
white_space
## [1] " before" "after "  " both "
str_trim(white_space)
## [1] "before" "after"  "both"
pad_space <- c("abc", "acbdefg")
#10 specifies the number of total characters you want in your string, and if it has less than the specified target, it will pad spaces until the target value is reached
str_pad(pad_space, 10)
## [1] "       abc" "   acbdefg"
str_pad(pad_space, 10, side = "right")
## [1] "abc       " "acbdefg   "
str_pad(pad_space, 10, side = "both")
## [1] "   abc    " " acbdefg  "

Pad with something else

string_nums <- as.character(1:15)
string_nums
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15"
str_pad(string_nums, 3, pad = "0")
##  [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011"
## [12] "012" "013" "014" "015"

Regular Expression

Regular expression is an “instruction” (or pattern) given to a function on what and how to match or replace strings (Eden, 2007).

We will briefly touch upon some regular expression with examples and provide resources for references at the end of the presentation.

stringr provides pattern matching functions to detect, locate, extract, match, replace, and split strings.

from Handling and Processing Strings in R, Sanchez, 2013

Regular expression can be used to:

  • identify match to a pattern: grep(..., value = FALSE), grepl(), stringr::str_detect()

  • extract match to a pattern: grep(..., value = TRUE), stringr::str_extract(), stringr::str_extract_all()

  • locate pattern within a string, e.g., give the start position of matched patterns. regexpr(), gregexpr(), stringr::str_locate(), string::str_locate_all()

  • replace a pattern: sub(), gsub(), stringr::str_replace(), stringr::str_replace_all()

  • split a string using a pattern: strsplit(), stringr::str_split()

Locate pattern

A helpful function when learning to locate pattern is the str_view which allows you to see what R is doing behind the scene This requires the htmlwidgets pacakge

str_view(sentences[1:5], "the")
str_view(fruit[1:5], "ap")

Search for a particular pattern in each element of a vector x. grep(pattern, x)

#let's find the location of all the sentences with the word "boy"
grep("boy", sentences)
## [1]  11  25 423 591 634 663 708
sentences[c(11, 423)]
## [1] "The boy was there when the sun rose."
## [2] "The boy owed his pal thirty cents."

Extract sentence

#this will return all sentences that contains the string "red"
str_subset(sentences, "red")
##  [1] "The colt reared and threw the tall rider."         
##  [2] "The wide road shimmered in the hot sun."           
##  [3] "See the cat glaring at the scared mouse."          
##  [4] "He ordered peach pie with ice cream."              
##  [5] "Pure bred poodles have curls."                     
##  [6] "Mud was spattered on the front of his white shirt."
##  [7] "The sofa cushion is red and of light weight."      
##  [8] "Torn scraps littered the stone floor."             
##  [9] "The doctor cured him with these pills."            
## [10] "The new girl was fired today at noon."             
## [11] "The third act was dull and tired the players."     
## [12] "Lire wires should be kept covered."                
## [13] "It is hard to erase blue or red ink."              
## [14] "The wreck occurred by the bank on Main Street."    
## [15] "The box is held by a bright red snapper."          
## [16] "The prince ordered his head chopped off."          
## [17] "The houses are built of red clay bricks."          
## [18] "The red tape bound the smuggled food."             
## [19] "Nine men were hired to dig the ruins."             
## [20] "The flint sputtered and lit a pine torch."         
## [21] "The old pan was covered with hard fudge."          
## [22] "The store walls were lined with colored frocks."   
## [23] "The clan gathered on each dull night."             
## [24] "The lake sparkled in the red hot sun."             
## [25] "Mark the spot with a sign painted red."            
## [26] "Smoke poured out of every crack."                  
## [27] "Serve the hot rum to the tired heroes."            
## [28] "He offered proof in the form of a lsrge chart."    
## [29] "The sip of tea revives his tired friend."          
## [30] "The door was barred, locked, and bolted as well."  
## [31] "A thick coat of black paint covered all."          
## [32] "The small red neon lamp went out."                 
## [33] "The green light in the brown box flickered."       
## [34] "He put his last cartridge into the gun and fired." 
## [35] "The ram scared the school children off."           
## [36] "Dimes showered down from all sides."               
## [37] "The sky in the west is tinged with orange red."    
## [38] "The red paper brightened the dim stage."           
## [39] "The hail pattered on the burnt brown grass."       
## [40] "The big red apple fell to the ground."
#value = TRUE, returns the matching elements themselves instead of location when value = FALSE
grep("red", sentences, value = TRUE)
##  [1] "The colt reared and threw the tall rider."         
##  [2] "The wide road shimmered in the hot sun."           
##  [3] "See the cat glaring at the scared mouse."          
##  [4] "He ordered peach pie with ice cream."              
##  [5] "Pure bred poodles have curls."                     
##  [6] "Mud was spattered on the front of his white shirt."
##  [7] "The sofa cushion is red and of light weight."      
##  [8] "Torn scraps littered the stone floor."             
##  [9] "The doctor cured him with these pills."            
## [10] "The new girl was fired today at noon."             
## [11] "The third act was dull and tired the players."     
## [12] "Lire wires should be kept covered."                
## [13] "It is hard to erase blue or red ink."              
## [14] "The wreck occurred by the bank on Main Street."    
## [15] "The box is held by a bright red snapper."          
## [16] "The prince ordered his head chopped off."          
## [17] "The houses are built of red clay bricks."          
## [18] "The red tape bound the smuggled food."             
## [19] "Nine men were hired to dig the ruins."             
## [20] "The flint sputtered and lit a pine torch."         
## [21] "The old pan was covered with hard fudge."          
## [22] "The store walls were lined with colored frocks."   
## [23] "The clan gathered on each dull night."             
## [24] "The lake sparkled in the red hot sun."             
## [25] "Mark the spot with a sign painted red."            
## [26] "Smoke poured out of every crack."                  
## [27] "Serve the hot rum to the tired heroes."            
## [28] "He offered proof in the form of a lsrge chart."    
## [29] "The sip of tea revives his tired friend."          
## [30] "The door was barred, locked, and bolted as well."  
## [31] "A thick coat of black paint covered all."          
## [32] "The small red neon lamp went out."                 
## [33] "The green light in the brown box flickered."       
## [34] "He put his last cartridge into the gun and fired." 
## [35] "The ram scared the school children off."           
## [36] "Dimes showered down from all sides."               
## [37] "The sky in the west is tinged with orange red."    
## [38] "The red paper brightened the dim stage."           
## [39] "The hail pattered on the burnt brown grass."       
## [40] "The big red apple fell to the ground."
#with regular expression \\w - match a word character 
#\\W - match a non word character 
str_subset(sentences, "\\w red")
##  [1] "The sofa cushion is red and of light weight."  
##  [2] "It is hard to erase blue or red ink."          
##  [3] "The box is held by a bright red snapper."      
##  [4] "The houses are built of red clay bricks."      
##  [5] "The red tape bound the smuggled food."         
##  [6] "The lake sparkled in the red hot sun."         
##  [7] "Mark the spot with a sign painted red."        
##  [8] "The small red neon lamp went out."             
##  [9] "The sky in the west is tinged with orange red."
## [10] "The red paper brightened the dim stage."       
## [11] "The big red apple fell to the ground."

Count occurences

#returns the number of "the" in the first 100 sentences
str_count(sentences, "the")[1:100]
##   [1] 1 2 1 0 0 0 1 0 0 0 2 0 2 1 1 1 0 1 1 1 1 1 2 0 2 1 0 1 1 0 1 2 2 1 1
##  [36] 0 0 0 1 1 2 1 1 1 1 2 1 1 1 0 1 1 0 1 0 0 2 0 1 2 2 1 1 1 1 0 1 1 2 2
##  [71] 2 1 1 0 0 0 2 0 0 0 1 2 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 0

Logical tests

#good to use in filter
str_detect(sentences, "the")[1:100]
##   [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
##  [12] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
##  [34]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [45]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
##  [56] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
##  [78] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [89]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [100] FALSE
grepl("the", sentences)[1:100]
##   [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
##  [12] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
##  [34]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [45]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
##  [56] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
##  [78] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [89]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [100] FALSE

Filter with str_detect

An example that uses filter with str_detect

#create a hypothetical dataset 
P_ID <- 1:50
num_fruit <- abs(round(rnorm(50, 3, 2),0))
## Warning in rnorm(50, 3, 2): '.Random.seed' is not an integer vector but of
## type 'NULL', so ignored
type_fruit <- rep(fruit[1:50])
fruit_eaters <- as.data.frame(cbind(P_ID, num_fruit, type_fruit)) %>% 
  mutate(type_fruit = as.character(type_fruit))
## Warning: package 'bindrcpp' was built under R version 3.4.4
head(fruit_eaters)
##   P_ID num_fruit  type_fruit
## 1    1         1       apple
## 2    2         2     apricot
## 3    3         5     avocado
## 4    4         4      banana
## 5    5         2 bell pepper
## 6    6         1    bilberry
#filter for berry eater  
berry_eaters <- filter(fruit_eaters, str_detect(type_fruit, "berry"))
berry_eaters
##    P_ID num_fruit  type_fruit
## 1     6         1    bilberry
## 2     7         1  blackberry
## 3    10         4   blueberry
## 4    11         1 boysenberry
## 5    19         1  cloudberry
## 6    21         3   cranberry
## 7    29         3  elderberry
## 8    32         5  goji berry
## 9    33         2  gooseberry
## 10   38         3 huckleberry
## 11   50         5    mulberry

Metacharacters are special characters that define specific operations

Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string. This is accomplished with the help of metacharacters that have specific meaning: $ * + . ? [ ] ^ { } | ( ) \.

Quantifier Matching

Specify how many repetitions of the pattern.

*: matches at least 0 times.

+: matches at least 1 times.

?: matches at most 1 times.

{n}: matches exactly n times.

{n,}: matches at least n times.

{n,m}: matches between n and m times.

Quantifier matching

strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")

#we are applying the quantifier to c
#return everything with a & b 
grep("ac*b", strings, value = TRUE) #disregarding the amount of c
## [1] "ab"     "acb"    "accb"   "acccb"  "accccb"
grep("ac+b", strings, value = TRUE) #with at least 1 c 
## [1] "acb"    "accb"   "acccb"  "accccb"
grep("ac?b", strings, value = TRUE) #with at most 1 c
## [1] "ab"  "acb"
grep("ac{2}b", strings, value = TRUE) #with exactly 2 c 
## [1] "accb"

Position Matching

Matches position of pattern within the string

^: matches the start of the string.

$: matches the end of the string.

\b: matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string.

\B: matches the empty string provided it is not at an edge of a word.

Position Matching

itemIDs <- c("RF3L02E03", "RF3M08E05", "RF3H10E08", "RL1L03E03", "RL1M05E05", "RL1H10E04", "RI2L03E07", "RI2M06E05", "RI2HSAMPLEE06", "WR4L02E03", "WR4M06E06", "WR4H09E03", "WR9L03E04", "WR9M08E04", "WR9H12E04", "LA1L01E11", "LA1M06E04", "LA1H09E04", "WR2L03E05", "WR2M06E05", "WR2H10E05", "LA2LSAMPLEE03", "LA2M06E04", "LA2H09E08", "RF4L02E03", "RF4M08E06", "RF4H09E07", "RL7L02E07", "RL7M06E06", "RL7H10E06", "RI1L02E07", "RI1M07E08", "RI1H11E07", "WR1L02E07", "WR1M07E07", "WR1H11E08")
# select items that ends with 4 using regex
str_subset(itemIDs, "4$")
## [1] "RL1H10E04" "WR9L03E04" "WR9M08E04" "WR9H12E04" "LA1M06E04" "LA1H09E04"
## [7] "LA2M06E04"

Match the start of a string

#select items starting with RF
str_subset(itemIDs, "^RF")
## [1] "RF3L02E03" "RF3M08E05" "RF3H10E08" "RF4L02E03" "RF4M08E06" "RF4H09E07"
#select items starting with WR
str_subset(itemIDs, "^WR")
##  [1] "WR4L02E03" "WR4M06E06" "WR4H09E03" "WR9L03E04" "WR9M08E04"
##  [6] "WR9H12E04" "WR2L03E05" "WR2M06E05" "WR2H10E05" "WR1L02E07"
## [11] "WR1M07E07" "WR1H11E08"

It is useful to include ^ metacharacter incase there are matches with the same pattern that is not at the beginning.

Anchor Sequences

Anchor Sequences in R from Sanchez, 2013

Digits

digits <- c("Charlie", "Charlie2", "Mary", "Marianne", "Mary2", "15") 

#match a digit character 
str_subset(digits, "\\d") #return anything containing digits
## [1] "Charlie2" "Mary2"    "15"
#match a non-digit character
str_subset(digits, "\\D") 
## [1] "Charlie"  "Charlie2" "Mary"     "Marianne" "Mary2"
#return anything containing non digits. Even if there is 1 nondigit character, it will be returned

Spaces

string <- c("School is fun. Especially recess. That's the best part. I love recess.")
#match a space character
str_replace_all(string, "\\s", "_") #replaced all spaces with "_"
## [1] "School_is_fun._Especially_recess._That's_the_best_part._I_love_recess."
#match a non space character
str_replace_all(string, "\\S", "_") #replaced all non space with "_"
## [1] "______ __ ____ __________ _______ ______ ___ ____ _____ _ ____ _______"

Words

#match a word character 
str_replace_all(string, "\\w", "z") #replace all characters within words with "z"
## [1] "zzzzzz zz zzz. zzzzzzzzzz zzzzzz. zzzz'z zzz zzzz zzzz. z zzzz zzzzzz."
#match a non word character 
str_replace_all(string, "\\W", "_") #replace all non word chracters with "_"
## [1] "School_is_fun__Especially_recess__That_s_the_best_part__I_love_recess_"
#this is useful for matching a specific word in str_subset 
str_subset(sentences, "red")[1:15]
##  [1] "The colt reared and threw the tall rider."         
##  [2] "The wide road shimmered in the hot sun."           
##  [3] "See the cat glaring at the scared mouse."          
##  [4] "He ordered peach pie with ice cream."              
##  [5] "Pure bred poodles have curls."                     
##  [6] "Mud was spattered on the front of his white shirt."
##  [7] "The sofa cushion is red and of light weight."      
##  [8] "Torn scraps littered the stone floor."             
##  [9] "The doctor cured him with these pills."            
## [10] "The new girl was fired today at noon."             
## [11] "The third act was dull and tired the players."     
## [12] "Lire wires should be kept covered."                
## [13] "It is hard to erase blue or red ink."              
## [14] "The wreck occurred by the bank on Main Street."    
## [15] "The box is held by a bright red snapper."
str_subset(sentences, "\\w red")
##  [1] "The sofa cushion is red and of light weight."  
##  [2] "It is hard to erase blue or red ink."          
##  [3] "The box is held by a bright red snapper."      
##  [4] "The houses are built of red clay bricks."      
##  [5] "The red tape bound the smuggled food."         
##  [6] "The lake sparkled in the red hot sun."         
##  [7] "Mark the spot with a sign painted red."        
##  [8] "The small red neon lamp went out."             
##  [9] "The sky in the west is tinged with orange red."
## [10] "The red paper brightened the dim stage."       
## [11] "The big red apple fell to the ground."

Text Processing

We will cover basic text processing using the tidytext package. However, many other packages exists such as quanteda.

Examples are from the Tidytext Mining with R

Example with some text from Emily Dickinson.

text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")

text
## [1] "Because I could not stop for Death -"  
## [2] "He kindly stopped for me -"            
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"
#and convert it into a dataframe
text_df <- data_frame(line = 1:4, text = text)

text_df
## # A tibble: 4 x 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Because I could not stop for Death -  
## 2     2 He kindly stopped for me -            
## 3     3 The Carriage held but just Ourselves -
## 4     4 and Immortality

The tidytext package follows Tidy data principles:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

This means Tidytext format is a table with one-token-per-row. We need to tokenize (a.k.a. split into individual words) the text data. This is done using unnest_toakens in tidytext.

unnest_tokens(output, input, token = “words”)

output: Output column to be created as string or symbol.

input: Input column that gets split as string or symbol.

token: Unit for tokenizing, or a custom tokenizing function. Built-in options are “words” (default), “characters”, “character_shingles”, “ngrams”, “skip_ngrams”, “sentences”, “lines”, “paragraphs”, and “regex”

text_df %>%
  unnest_tokens(word, text)
## # A tibble: 20 x 2
##     line word       
##    <int> <chr>      
##  1     1 because    
##  2     1 i          
##  3     1 could      
##  4     1 not        
##  5     1 stop       
##  6     1 for        
##  7     1 death      
##  8     2 he         
##  9     2 kindly     
## 10     2 stopped    
## 11     2 for        
## 12     2 me         
## 13     3 the        
## 14     3 carriage   
## 15     3 held       
## 16     3 but        
## 17     3 just       
## 18     3 ourselves  
## 19     4 and        
## 20     4 immortality

After running unnest_tokens:

  1. Punctuation has been stripped.

  2. By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).

Now the data is in Tidy format, which allows for manipulation using tidyverse such as dplyr, tidyr, and ggplot2.

Tidying Jane Austen’s text

We are going to import Jane Austen’s 6 completed, published novels from the janeaustenr package (Silge 2016), and transform them into a tidy format. The janeaustenr package provides these texts in a one-row-per-line format, where a line in this context is analogous to a literal printed line in a physical book. Let’s start with that, and also use mutate() to annotate a linenumber quantity to keep track of lines in the original format and a chapter (using a regex) to find where all the chapters are.

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()

original_books
## # A tibble: 73,422 x 4
##    text                  book                linenumber chapter
##    <chr>                 <fct>                    <int>   <int>
##  1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
##  2 ""                    Sense & Sensibility          2       0
##  3 by Jane Austen        Sense & Sensibility          3       0
##  4 ""                    Sense & Sensibility          4       0
##  5 (1811)                Sense & Sensibility          5       0
##  6 ""                    Sense & Sensibility          6       0
##  7 ""                    Sense & Sensibility          7       0
##  8 ""                    Sense & Sensibility          8       0
##  9 ""                    Sense & Sensibility          9       0
## 10 CHAPTER 1             Sense & Sensibility         10       1
## # ... with 73,412 more rows

Now, we need to restructure it in the one-token-per-row format using the unnest_tokens() function.

tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books
## # A tibble: 725,055 x 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # ... with 725,045 more rows

Removing stop words

Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with anti_join(). Anti_join return all rows from x where there are not matching values in y, keeping just columns from x.

You can also customize stop words (example further down)

#If stop words is not applied 
tidy_books %>%
  count(word, sort = TRUE) 
## # A tibble: 14,520 x 2
##    word      n
##    <chr> <int>
##  1 the   26351
##  2 to    24044
##  3 and   22515
##  4 of    21178
##  5 a     13408
##  6 her   13055
##  7 i     12006
##  8 in    11217
##  9 was   11204
## 10 it    10234
## # ... with 14,510 more rows
#see that the most common words are: "the", "to", "and", etc...

#apply the stop word 
data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words)
## Joining, by = "word"

Counting and plotting

We can also use dplyr’s count() to find the most common words in all the books as a whole and plot it in ggplot2.

#count

tidy_books %>%
  count(word, sort = TRUE) 
## # A tibble: 13,914 x 2
##    word       n
##    <chr>  <int>
##  1 miss    1855
##  2 time    1337
##  3 fanny    862
##  4 dear     822
##  5 lady     817
##  6 sir      806
##  7 day      797
##  8 emma     787
##  9 sister   727
## 10 house    699
## # ... with 13,904 more rows
#ggplot

tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Sentiment Analysis

When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tools of text mining to approach the emotional content of text programmatically.

There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package contains several sentiment lexicons in the sentiments dataset.

sample_n(sentiments, size = 12)
## # A tibble: 12 x 4
##    word          sentiment lexicon  score
##    <chr>         <chr>     <chr>    <int>
##  1 invincibility positive  bing        NA
##  2 appreciating  <NA>      AFINN        2
##  3 fresh         <NA>      AFINN        1
##  4 recuse        litigious loughran    NA
##  5 refused       <NA>      AFINN       -2
##  6 bacteria      negative  nrc         NA
##  7 crack         negative  nrc         NA
##  8 congress      disgust   nrc         NA
##  9 successfully  positive  loughran    NA
## 10 fancier       positive  bing        NA
## 11 satirical     negative  bing        NA
## 12 nepotism      anger     nrc         NA

There are three sentiment lexicons in sentiments, AFFIN , bing , nrc, each with certain properties (can pick which suits your project’s need).

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

To get specific sentiment lexicons, you can use get_sentiments().

Example of Sentiment Analysis

We already imported Jane Austen’s books in the previous section. Let’s perform sentiment analysis on the most common “joy” and “anger” words in the book Emma using inner_join. Inner-join return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

#joy
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 298 x 2
##    word          n
##    <chr>     <int>
##  1 friend      166
##  2 hope        143
##  3 happy       125
##  4 love        117
##  5 deal         92
##  6 found        92
##  7 happiness    76
##  8 pretty       68
##  9 true         66
## 10 comfort      65
## # ... with 288 more rows
#anger
nrc_anger <- get_sentiments("nrc") %>% 
  filter(sentiment == "anger")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_anger) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 313 x 2
##    word           n
##    <chr>      <int>
##  1 ill           72
##  2 bad           60
##  3 feeling       56
##  4 bear          52
##  5 words         49
##  6 obliging      34
##  7 evil          33
##  8 difficulty    30
##  9 spite         24
## 10 loss          23
## # ... with 303 more rows

Finding the most common positive and negative words

Combining count with word and sentiment, we can find out how much each word contributed to each sentiment.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,555 x 3
##    word      sentiment     n
##    <chr>     <chr>     <int>
##  1 miss      negative   1855
##  2 happy     positive    534
##  3 love      positive    495
##  4 pleasure  positive    462
##  5 poor      negative    424
##  6 happiness positive    369
##  7 comfort   positive    292
##  8 doubt     negative    281
##  9 affection positive    272
## 10 perfectly positive    271
## # ... with 2,545 more rows
#Now lets plot it
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

Custom Stop-word

We can see an anomaly related to the word “miss” –it is coded as negative but it is used as a title for young, unmarried women in Jane Austen’s works. We can add “miss” to a custom stop-words list using bind_rows():

custom_stop_words <- bind_rows(data_frame(word = c("miss"), 
                                          lexicon = c("custom")), 
                               stop_words)

custom_stop_words
## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Wordclouds

Wordclouds can be made using the package wordcloud.

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

Most common positive and negative words in Jane Austen’s novels

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"

Sentences

Some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that

I am not having a good day.

is a sad sentence, not a happy one, because of negation.

R packages such as coreNLP (T. Arnold and Tilton 2016), cleanNLP (T. B. Arnold 2016), and sentimentr (Rinker 2017) are examples of such sentiment analysis algorithms. For these, we may want to tokenize text into sentences, and it makes sense to use a new name for the output column in such a case.

Sentimentr package