Text Processing

Author

Izzy Crame

Let the Games Begin!

Text processing in R

We’ll need the following packages today:

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidytext)
library(ggplot2)
library(tidyr)
library(stringr)
library(gutenbergr)
library(babynames)
library(readr)
library(wordcloud)
Loading required package: RColorBrewer
library(stm)
stm v1.3.8 successfully loaded. See ?stm for help. 
 Papers, resources, and other materials at structuraltopicmodel.com
library(reshape2)

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths
library(textdata)

Text based data can be incredibly messy, you’ll have a lot of different issues that you may encounter. One of the better learning methods is to take this to the very minutia, and to use references frequently.

rob to me when i asked how i should tackle this:

String Manipulations

Strings can be anything wrapped in quotes or read in as text:

c("TRUE", "happy hunger games!", "District12", "74th", "katniss")
[1] "TRUE"                "happy hunger games!" "District12"         
[4] "74th"                "katniss"            

We’ll also be using stringr, which is a useful package for manipulating data. Here’s a link to the documentation: https://stringr.tidyverse.org/

In this package, all functions start with str_ and take a vector of strings as the first argument.

They also have built-in sample data we’ll be exploring in this tutorial:

tibble(babynames)
# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Reading in Data

To explore these functions, we’ll use some sample data from The Hanging Tree, a song of rebellion from District 12.

lyrics <- c("Are you, are you",
          "Coming to the tree",
          "Where they strung up a man they say murdered three?",
          "Strange things did happen here",
          "No stranger would it be",
          "If we met up at midnight in the hanging tree.")

Tokenizing Text

# let's make the tibble of this text into a dataframe
lyrics_df <- tibble(line = 1:6, text = lyrics)

# now we'll peek and see what that dataframe looks like
lyrics_df
# A tibble: 6 × 2
   line text                                               
  <int> <chr>                                              
1     1 Are you, are you                                   
2     2 Coming to the tree                                 
3     3 Where they strung up a man they say murdered three?
4     4 Strange things did happen here                     
5     5 No stranger would it be                            
6     6 If we met up at midnight in the hanging tree.      

Next, we want to convert this into a tidy dataframe (one token per document per row)

# we'll use the unnest_tokens() function from the tidytext package to do this
lyrics_tokenized <- lyrics_df %>% 
  # this is similar to pivoting longer, where the first arg is the name of 
  # the new column and the second arg is where the values are coming from
  unnest_tokens(word, text)

# other columns remain, text is converted to lower(), and punctuation is
# stripped

Major Stringr Functions

These are reference charts for string manipulation functions in Base R and Stringr:

Stringr String Manipulations

Changing Case

As we saw a couple of weeks ago, we may want to clean up our data by changing the capitalization. You can go from your data to upper, lower, title, and snake case (although snake case may be less helpful for text processing)

# UPPER CASE
str_to_upper(lyrics_df$text) %>% head()
[1] "ARE YOU, ARE YOU"                                   
[2] "COMING TO THE TREE"                                 
[3] "WHERE THEY STRUNG UP A MAN THEY SAY MURDERED THREE?"
[4] "STRANGE THINGS DID HAPPEN HERE"                     
[5] "NO STRANGER WOULD IT BE"                            
[6] "IF WE MET UP AT MIDNIGHT IN THE HANGING TREE."      
# lower case
str_to_lower(lyrics_df$text) %>% head()
[1] "are you, are you"                                   
[2] "coming to the tree"                                 
[3] "where they strung up a man they say murdered three?"
[4] "strange things did happen here"                     
[5] "no stranger would it be"                            
[6] "if we met up at midnight in the hanging tree."      
# camelCase
str_to_camel(lyrics_df$text) %>% head()
[1] "areYouAreYou"                             
[2] "comingToTheTree"                          
[3] "whereTheyStrungUpAManTheySayMurderedThree"
[4] "strangeThingsDidHappenHere"               
[5] "noStrangerWouldItBe"                      
[6] "ifWeMetUpAtMidnightInTheHangingTree"      
# kebab-case
str_to_kebab(lyrics_df$text) %>% head()
[1] "are-you-are-you"                                   
[2] "coming-to-the-tree"                                
[3] "where-they-strung-up-a-man-they-say-murdered-three"
[4] "strange-things-did-happen-here"                    
[5] "no-stranger-would-it-be"                           
[6] "if-we-met-up-at-midnight-in-the-hanging-tree"      
# snake_case
str_to_snake(lyrics_df$text) %>% head()
[1] "are_you_are_you"                                   
[2] "coming_to_the_tree"                                
[3] "where_they_strung_up_a_man_they_say_murdered_three"
[4] "strange_things_did_happen_here"                    
[5] "no_stranger_would_it_be"                           
[6] "if_we_met_up_at_midnight_in_the_hanging_tree"      
# Title Case
str_to_title(lyrics_df$text) %>% head()
[1] "Are You, Are You"                                   
[2] "Coming To The Tree"                                 
[3] "Where They Strung Up A Man They Say Murdered Three?"
[4] "Strange Things Did Happen Here"                     
[5] "No Stranger Would It Be"                            
[6] "If We Met Up At Midnight In The Hanging Tree."      
# Sentence case
str_to_sentence(lyrics_df$text) %>% head()
[1] "Are you, are you"                                   
[2] "Coming to the tree"                                 
[3] "Where they strung up a man they say murdered three?"
[4] "Strange things did happen here"                     
[5] "No stranger would it be"                            
[6] "If we met up at midnight in the hanging tree."      

Joining Strings

# joins all of the entries together
str_c("Bread", "and", "Circuses")
[1] "BreadandCircuses"
# you can add a specified separator like a space
str_c("Bread", "and", "Circuses", sep = " ")
[1] "Bread and Circuses"
# or a symbol
str_c("Bread", "and", "Circuses", sep = "!")
[1] "Bread!and!Circuses"
# for example, csv's are comma separated vectors 
str_c("Bread", "and", "Circuses", sep = ", ")
[1] "Bread, and, Circuses"
# notice here your spacing matters and that separators only appear between entries and not
# after the last entry
str_c("Bread", "and", "Circuses", sep = " ! ")
[1] "Bread ! and ! Circuses"

String Length

Maybe you want to know how many characters are in each entry in the vector. str_length() will give you the length of each entry in it’s indexed position.

# see what the first 5 entries are in the fruit sample data
lyrics_tokenized$word[1:5]
[1] "are"    "you"    "are"    "you"    "coming"
# get length of each entry
str_length(lyrics_tokenized$word[1:5])
[1] 3 3 3 3 6

Modify Strings

Maybe you are a gamemaker trying to censor the lyrics for propoganda. You want to know where all the words that might upset the public are.

lyrics_df
# A tibble: 6 × 2
   line text                                               
  <int> <chr>                                              
1     1 Are you, are you                                   
2     2 Coming to the tree                                 
3     3 Where they strung up a man they say murdered three?
4     4 Strange things did happen here                     
5     5 No stranger would it be                            
6     6 If we met up at midnight in the hanging tree.      
lyrics_tokenized
# A tibble: 38 × 2
    line word  
   <int> <chr> 
 1     1 are   
 2     1 you   
 3     1 are   
 4     1 you   
 5     2 coming
 6     2 to    
 7     2 the   
 8     2 tree  
 9     3 where 
10     3 they  
# ℹ 28 more rows
 # strung, murdered, hanging 

str_sub lets you pick which character index you would like to select for. Or maybe, you want to replace a portion of your strings with particular characters.

replaced_words <- str_sub(lyrics_tokenized$word[c(11, 17, 37)]) 
str_sub(replaced_words,1, -3) <- "x"
replaced_words
[1] "xng" "xed" "xng"

You can also do this with str_replace_all()

replaced_words <- str_replace_all(lyrics_tokenized$word, 
                                  c("strung" = "sung",
                                    "up" = "of",
                                    "murdered" = "climbed",
                                    "hang" = "climb"))
replaced_words
 [1] "are"      "you"      "are"      "you"      "coming"   "to"      
 [7] "the"      "tree"     "where"    "they"     "sung"     "of"      
[13] "a"        "man"      "they"     "say"      "climbed"  "three"   
[19] "strange"  "things"   "did"      "happen"   "here"     "no"      
[25] "stranger" "would"    "it"       "be"       "if"       "we"      
[31] "met"      "of"       "at"       "midnight" "in"       "the"     
[37] "climbing" "tree"    

Subsetting Strings

str_subset() returns all elements of the vector where there is at least 1 match of the pattern you instruct.

# this looks for "str" in the tokenized lyrics
str_subset(lyrics_tokenized$word, "str")
[1] "strung"   "strange"  "stranger"

Counting Patterns in Strings

str_count() gives you how many times a pattern occurs in each entry.

# this finds the vowel count in each word
str_count(lyrics_tokenized$word, "[aeiou]")
 [1] 2 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 3 2 2 1 1 2 2 1 2 2 1 1 1 1 1 1 1 2 1 1 2 2

Detecting Patterns in Strings

# this detects whether or not the pattern "ee" is present in strings in
# the tokenized lyrics
str_detect(lyrics_tokenized$word, "ee")
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE  TRUE

Locating Patterns

str_locate(lyrics_df$text, "tree")
     start end
[1,]    NA  NA
[2,]    15  18
[3,]    NA  NA
[4,]    NA  NA
[5,]    NA  NA
[6,]    41  44

Dealing with White Space

You can remove white space using str_trim()

white_space <- c(" before", "after ", " both ")
# use str_trim() to take off white space before, after, and both
str_trim(white_space)
[1] "before" "after"  "both"  

You can pad with extra space using str_pad()

str_pad(lyrics_tokenized$word[1:8], 10)
[1] "       are" "       you" "       are" "       you" "    coming"
[6] "        to" "       the" "      tree"
str_pad(lyrics_tokenized$word[1:8], 10, side = "right")
[1] "are       " "you       " "are       " "you       " "coming    "
[6] "to        " "the       " "tree      "
str_pad(lyrics_tokenized$word[1:8], 10, side = "both")
[1] "   are    " "   you    " "   are    " "   you    " "  coming  "
[6] "    to    " "   the    " "   tree   "

You can also pad with something else

str_pad(lyrics_tokenized$word[1:8], 10, pad = "0", side = "right")
[1] "are0000000" "you0000000" "are0000000" "you0000000" "coming0000"
[6] "to00000000" "the0000000" "tree000000"

Remember Who The Real Enemy Is (it’s regex)

Regular Expressions

Regular expressions are a challenging part of text processing that you’ll likely have to reference every time. This section hopefully will function as a resource for that process.

Stringr Regular Expressions

str_view() returns a tibble showing the way the strings are delineated behind the scenes

str_view(lyrics_tokenized$word, "str")
[11] │ <str>ung
[19] │ <str>ange
[25] │ <str>anger

grep() is a baser function that searches for matches to a pattern within each element of a character vector. str_subset() is the stringr equivalent.

# lets see which lines use the word "the"
grep("the", lyrics_df$text)
[1] 2 3 6
# setting value = TRUE returns the full sentences
grep("the", lyrics_df$text, value = TRUE)
[1] "Coming to the tree"                                 
[2] "Where they strung up a man they say murdered three?"
[3] "If we met up at midnight in the hanging tree."      
# you can also get the full sentences returned using str_subset()
str_subset(lyrics_df$text, "the")
[1] "Coming to the tree"                                 
[2] "Where they strung up a man they say murdered three?"
[3] "If we met up at midnight in the hanging tree."      

You can also use str_detect() or grepl() to return a logical evaluation of whether a pattern exists

str_detect(lyrics_df$text, "the")
[1] FALSE  TRUE  TRUE FALSE FALSE  TRUE
# or in base r 
grepl("the", lyrics_df$text)
[1] FALSE  TRUE  TRUE FALSE FALSE  TRUE

Notice when we look at the full picture, we see that 2 of the sentences have the word “the” and one of them does not. Instead, it has a pattern for “the” in “they”.

str_view(lyrics_df$text, "the")
[2] │ Coming to <the> tree
[3] │ Where <the>y strung up a man <the>y say murdered three?
[6] │ If we met up at midnight in <the> hanging tree.

Note that for later.

ggplot Example

Since str_detect() outputs a logical vector of the same length as the input, it works well with filter(). We’ll use the sample data from babynames for this example in honor of Peeta “if it werent for the baby” Mellark.

# our goal here is to combine str_detect() and filter() to create a count tibble
babynames %>% 
  filter(str_detect(name, "y")) %>% 
  count(name, wt = n, sort = TRUE)
# A tibble: 23,371 × 2
   name          n
   <chr>     <int>
 1 Mary    4138360
 2 Anthony 1439784
 3 Dorothy 1110773
 4 Timothy 1072882
 5 Nancy   1004956
 6 Betty   1003093
 7 Jeffrey  977240
 8 Ryan     947787
 9 Gary     903245
10 Ashley   859453
# ℹ 23,361 more rows
# you can also turn this right into a plot!
babynames %>% 
  group_by(year) %>% 
  summarize(proportion_y_names = mean(str_detect(name, "y"))) %>% 
  ggplot(aes(x = year, y = proportion_y_names)) +
  geom_line()

You can also use str_count() with mutate()

babynames %>% 
  count(name) %>% 
  mutate(
     vowels = str_count(name, "[aeiou]"),
     consonants = str_count(name, "[^aeiou]")
  )
# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 Aaban        10      2          3
 2 Aabha         5      2          3
 3 Aabid         2      2          3
 4 Aabir         1      2          3
 5 Aabriella     5      4          5
 6 Aada          1      2          2
 7 Aadam        26      2          3
 8 Aadan        11      2          3
 9 Aadarsh      17      2          5
10 Aaden        18      2          3
# ℹ 97,300 more rows

Note here that str_count() is CASE SENSITIVE, so to capture all the occurrences, include both upper and lower case

babynames %>% 
  count(name) %>% 
  mutate( vowels = str_count(name, "[aeiouAEIOU]"), 
          consonants = str_count(name, "[^aeiouAEIOU]") 
  )
# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 Aaban        10      3          2
 2 Aabha         5      3          2
 3 Aabid         2      3          2
 4 Aabir         1      3          2
 5 Aabriella     5      5          4
 6 Aada          1      3          1
 7 Aadam        26      3          2
 8 Aadan        11      3          2
 9 Aadarsh      17      3          4
10 Aaden        18      3          2
# ℹ 97,300 more rows

or convert all your data to a particular case using the modifiers from above (like str_to_lower())

Anchor Sequences

These anchor sequences give you more specific ways to match regex

Digits

test <- c("TRUE", "happy hunger games!", "District12", "12", "74th", "katniss")
# this gives you the entries with digits
str_subset(test, "\\d") 
[1] "District12" "12"         "74th"      
# this gives you the entries with characters which are non-digits
str_subset(test, "\\D") 
[1] "TRUE"                "happy hunger games!" "District12"         
[4] "74th"                "katniss"            

Spaces

test_lines <- c(lyrics_df$text[1])
# this will replace all spaces in the character string with ~
str_replace_all(test_lines, "\\s", "~")
[1] "Are~you,~are~you"

Words

Word characters in regex include letters, numbers, and underscores. Non-word characters refer to punctuation, spaces, and symbols.

# lets replace all the words in this sentence with i
str_replace_all(test_lines, "\\w", "i")
[1] "iii iii, iii iii"
# actually, let's replace anything that's not a word with !
str_replace_all(test_lines, "\\W", "!")
[1] "Are!you!!are!you"
# now let's look for any sentence that has "the" in it
str_subset(lyrics_df$text, "the")
[1] "Coming to the tree"                                 
[2] "Where they strung up a man they say murdered three?"
[3] "If we met up at midnight in the hanging tree."      
# wait remember how earlier that gave us any sentence where "t" "h" and "e" occurred 
# together, I only want the ones that mention the WORD "the"

# let's use the boundary anchor to specify that we want a WORD
str_subset(lyrics_df$text, "\\bthe\\b")
[1] "Coming to the tree"                           
[2] "If we met up at midnight in the hanging tree."

Quantifiers

  • ‘?’ makes a pattern optional

  • ‘+’ lets a pattern repeat

  • ‘*’ lets a pattern be optional OR repeat

  • ‘.’ means any

  • ‘^’ means except

  • ‘|’ picks between one or more patterns

  • { , } gives boundaries on how many times to match a pattern (ex. {n} is exactly n times, {n, } is at least n times, and {n,m} is between n and m times

test_strings <- c("abby", "acab", "abcd", "abbb", "abac")
# this returns strings that contain "a" followed by any other character
str_view(test_strings, "a.")
[1] │ <ab>by
[2] │ <ac><ab>
[3] │ <ab>cd
[4] │ <ab>bb
[5] │ <ab><ac>
# this returns strings that contain "a' followed by any character and then "b"
str_view(test_strings, "a.b") 
[1] │ <abb>y
[4] │ <abb>b
# this returns strings that match an "a" then optionally have a "b"
str_view(test_strings, "ab?") 
[1] │ <ab>by
[2] │ <a>c<ab>
[3] │ <ab>cd
[4] │ <ab>bb
[5] │ <ab><a>c
# this returns strings that match an "a" then have at least one "b"
str_view(test_strings, "ab+") 
[1] │ <abb>y
[2] │ ac<ab>
[3] │ <ab>cd
[4] │ <abbb>
[5] │ <ab>ac
# this returns strings that match an "a" then any number of "b"
str_view(test_strings, "ab*") 
[1] │ <abb>y
[2] │ <a>c<ab>
[3] │ <ab>cd
[4] │ <abbb>
[5] │ <ab><a>c
# this returns strings that do not match "ab"
str_view(test_strings, "[^ab]")
[1] │ abb<y>
[2] │ a<c>ab
[3] │ ab<c><d>
[5] │ aba<c>
# this returns strings that contain either "str" or "ng"
str_view(lyrics_tokenized$word, "er|ing")
 [5] │ com<ing>
 [9] │ wh<er>e
[17] │ murd<er>ed
[20] │ th<ing>s
[23] │ h<er>e
[25] │ strang<er>
[37] │ hang<ing>

Text Processing

now the fun stuff

We’re going to read in the complete text of the Hunger Games by Suzanne Collins and clean it up using tokenization and getting rid of stop words, or words that we don’t want to influence our data analysis.

thg <- read_csv("https://www.dropbox.com/scl/fi/y693rxodocciymxl2sqg6/HungerGames.txt?rlkey=zm24c89lzpn9yeanst80w2h6f&e=1&st=mhmucvbc&dl=1")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 528 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ref: refs/heads/masterhe Hunger Games

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# now lets add a column that tells us which part of the book each line happens
by_part <- thg %>% 
  # rename the column of text
  rename(text = `ref: refs/heads/masterhe Hunger Games`) %>% 
  # add a new column called 'part' that is a cummulative sum of each time we detect in the column 'text' "PART" followed
  # by any number of "I"s
  mutate(part = cumsum(str_detect(text, regex("PART I+"))),
                      # and another column that tells us what line in the original file this is
                       line_number = row_number())

# now we use unnext_tokens() to turn this into a long dataframe of every word in the whole book
tidy_thg <- by_part %>% 
  unnest_tokens(word, text)

# from the tidytext package, we have a built in stop_words dictionary. You can also build your own per your research
# or find one available online 
data("stop_words")

# use anti_join() to remove any rows where the text is a stop word and create a new dataframe called thg_words
thg_words <- tidy_thg %>% 
  anti_join(stop_words)
Joining with `by = join_by(word)`

Now, we have a dataframe that has all of the words we are interested in for text analysis. We can start answering questions with this data frame.

Let’s say we want to predict who Katniss is going to end up with based on how much she talks about Peeta or Gale. We might want to count how many times in the book she mentions each of their names.

# take the tidy dataframe of tokenized words
thg_words %>% 
  # count the occurences of each word and sort by frequency
  count(word, sort = T) %>% 
  # filter to only the words we're interested in (ones that contain
  # their names)
  filter(str_detect(word, "peeta|gale")) 
# A tibble: 4 × 2
  word       n
  <chr>  <int>
1 peeta    496
2 gale      92
3 peetas    82
4 gales     12

Looks like Peeta is living in her brain rent free.

Now let’s plot what the most used words in the whole book are using a similar process:

# take the tidy dataframe of tokenized words
thg_words %>% 
  # count the occurences of each word and sort by frequency
  count(word, sort = T) %>% 
  # filter to only the words used more than 100 times (this is an arbitrary
  # boundary, you can choose whatever fits your data best)
  filter(n > 100) %>%
  # create a tible of words ordered by frequency and their counts
  mutate(word = reorder(word, n)) %>% 
  # plot using both frequency and word
  ggplot(aes(n, word)) +
  # use column plot
  geom_col() + 
  # no labels
  labs(y = NULL) 

By FAR the word she says most in this book is Peeta.

Sentiment Analysis

# there are 3 built in dictionaries
# afinn has values
get_sentiments("afinn")
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows
# bing has positive and negative
get_sentiments("bing")
# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows
# nrc has emotion words
get_sentiments("nrc")
# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows
# we'll use bing for our purposes today

Let’s create a sentiment analysis visualization of how positive or negative the words Katniss uses are through each of the 3 parts of the book.

# we're going to try to join our thg_word data frame with our sentiment evaluations
thg_sentiment <- thg_words %>% 
  # using inner_join() and get_sentiments() from the tidytext package, we'll pull the sentiments from the bing 
  # dictionary and align them with the words in the thg_word dataframe
  inner_join(get_sentiments("bing")) %>% 
  # then we're going to do a frequency count, retaining the part of the book it's from, chunked by lines of 20, 
  # retaining sentiment
  count(part, index = line_number%/% 20, sentiment)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 33370 of `x` matches multiple rows in `y`.
ℹ Row 2957 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# now we want to pivot this wider for our visualization purposes so we have a row for each chunk of 20 lines
# and columns for which part of the book it's from, which chunk of 20 lines, how many positive and negative
# words were used, and a calculated column for the net sentiment (positive - negative)
thg_sentiment_wide <- thg_sentiment %>% 
  # get names for new columns from sentiment column
  pivot_wider( names_from = sentiment, 
               # get values from count column
               values_from = n) %>% 
  # make a new column that's n positive - n negative
  mutate(sentiment = positive - negative)

# now we plot this dataframe as a column and facet wrap by part of the book
thg_sentiment_wide %>% 
  ggplot(aes(index, sentiment, fill = part)) +
  geom_col(show.legend = F) +
  facet_wrap(~part, scales = "free_x")

Yikes… Katniss is having a baaaad time.

Next let’s try to see what words are telling us she’s having a bad time. We can use sentiment analysis in a slightly different way to identify which positive and negative words are used the most frequently.

# trying to get a dataframe of the counts of each word used and associate it with a sentiment from the bing dictionary
bing_word_counts <- thg_words %>% 
  # using inner_join() and get_sentiments() from the tidytext package, we'll pull the sentiments from the bing 
  # dictionary and align them with the words in the thg_word dataframe
  inner_join(get_sentiments("bing")) %>% 
  # then we're going to count word occurences by sentiment and sort them by frequency
  count(word, sentiment, sort = T) %>% 
  # and ungroup our data so it's nice at the end
  ungroup()
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 33370 of `x` matches multiple rows in `y`.
ℹ Row 2957 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# next we're going to turn this into a visualization
bing_word_counts %>% 
  # we want to group by positive and negative words
  group_by(sentiment) %>% 
  # take the top 10 of each group
  slice_max(n, n = 10) %>% 
  ungroup() %>% 
  # reorder the word column by frequency
  mutate(word = reorder(word, n)) %>% 
  # then begin plotting counts of words filled by their sentiment
  ggplot(aes(n, word, fill = sentiment))+
  # using geom_col()
  geom_col(show.legend = F) + 
  # and facet wrapping by sentiment so we can see different graphs for pos and neg
  facet_wrap(~sentiment, scales = "free_y")

oop there’s a problem – Let’s not Rue the day we didn’t check our data.

# you can add to the stop words dictionary to have extra words that you input based on your own data specifications
# and then run this process again with the anti_join() step using the new dictionary
custom_stop_words <- bind_rows(tibble(word = c("rue"), 
                                      lexicon = c("custom")),
                               stop_words)

Word Cloud

Another way to know what’s coming up a lot for Katniss through this book without the attached sentiment is a word cloud. This lets us see what are the most commonly used words.

thg_words %>% 
  # getting rid of words we don't want to pay attention to using the stop_words data
  anti_join(stop_words) %>% 
  # counding the occurences of each word
  count(word) %>% 
  # using the function wordcloud() from the package wordcloud we use the args what words we want
  # what the frequency is, and the maximum number of words we want included.
  with(wordcloud(word, n, max.words = 100))
Joining with `by = join_by(word)`

Topic Modeling

Now imagine you’re Caesar Flickerman and you’re covering the Hunger Games. You want to know what will get people talking! An important part of that, is knowing what topics to talk about. Topic modeling helps us get a lay of the land and what topics are coming up frequently in our corpus of text.

First, we need a tidy text table of the words and their frequency counts. To avoid our earlier issue, we’re going to add all the main names in the book to our stop_words data.

custom_stop_words <- bind_rows(tibble(word = c("rue", 
                                               "peeta", 
                                               "katniss", 
                                               "gale", 
                                               "prim", 
                                               "primrose",
                                               "haymitch",
                                               "effie",
                                               "cinna",
                                               "cato",
                                               "suzanne",
                                               "collins",
                                               "im",
                                               "hes",
                                               "clove",
                                               "ive",
                                               "hunger",
                                               "games"), 
                                      lexicon = c("custom")),
                               stop_words)
# there are fancier ways to do this too, you can use filter and str_detect

thg_no_names <- tidy_thg %>%
  anti_join(custom_stop_words) %>% 
  filter(!str_detect(word, "peeta|\\brue.\\b|katniss|\\bgale.\\b|haymitch|\\bcato.\\b"))
Joining with `by = join_by(word)`

Next we want to identify the highest tf-idf words in the story, or which words are the most important to a document in a collection of documents.

# make a new dataframe thg_tf_idf which takes thg_no_names and adds the words' tf_idf score and 
# parses each words importance to the story by part

thg_tf_idf <-  thg_no_names %>% 
  # count how many times in each part each word occurs and sort by frequency
  count(part, word, sort = T) %>% 
  # add the tf_idf rating
  bind_tf_idf(word, part, n) %>% 
  # arrange by tf_idf score
  arrange(-tf_idf) %>% 
  # group by part of the book
  group_by(part) %>% 
  # pick the top 10
  top_n(10) %>% 
  ungroup()
Selecting by tf_idf
# now we're going to plot this information in a facet wrap
thg_tf_idf %>% 
  # reorder the word column in ascending order of tf_idf ratings by part
  mutate(word = reorder_within(word, tf_idf, part)) %>% 
  # set up a ggplot with word on the x axis, tf_idf score on the y axis and fill by part
  ggplot(aes(word, tf_idf, fill = part)) +
  geom_col(show.legend = F) +
  # facet wrap the plotted columns by part
  facet_wrap(~part, scales = "free") +
  # make sure we're in the correct order of words
  scale_x_reordered() +
  # this might make more sense vertically
  coord_flip()

# messing around with this told me what to include in the filters and the stop words up above

This is basically just another frequency plot, so lets get into the actual topic modeling

# using tidytext, we need to make our tidy dataframe with no names into a sparse matrix
thg_sparse <- thg_no_names %>% 
  count(part, word, sort = T) %>% 
  cast_sparse(part, word, n)

# then using stm() we'll feed it the sparse matrix and specify that we want 6 topics
topic_model <- stm(thg_sparse, K = 6, 
                   verbose = F, init.type = "Spectral")

# summary() will work here for a basic review
summary(topic_model)
A topic model with 6 topics, 3 documents and a 7054 word dictionary.
Topic 1 Top Words:
     Highest Prob: district, time, boy, water, dont, tree, careers 
     FREX: careers, supplies, pack, tree, tracker, nest, boy 
     Lift: 100, 14, 15, 16, 17, 18, abandoning 
     Score: 100, pack, supplies, pyramid, tracker, nest, wasps 
Topic 2 Top Words:
     Highest Prob: district, dont, time, people, mother, capitol, tributes 
     FREX: trinket, reaping, table, called, father, people, mother 
     Lift: 13s, cookies, dandelion, friendly, honest, pig, skills 
     Score: 13s, trinket, reaping, tesserae, bakers, justice, son 
Topic 3 Top Words:
     Highest Prob: dont, time, hand, feel, ill, thresh, district 
     FREX: thresh, cave, stream, rocks, berries, goat, leg 
     Lift: 150, 19, 20, 21, 22, 23, 24 
     Score: 150, cave, stream, goat, kiss, mutts, mutt 
Topic 4 Top Words:
     Highest Prob: district, time, boy, water, dont, tree, careers 
     FREX: careers, supplies, pack, tree, tracker, nest, boy 
     Lift: zigzag, 14, 15, 16, 17, 18, abandoning 
     Score: zigzag, pack, supplies, pyramid, tracker, nest, wasps 
Topic 5 Top Words:
     Highest Prob: dont, time, hand, feel, ill, thresh, district 
     FREX: thresh, cave, stream, rocks, berries, goat, leg 
     Lift: 1s, 19, 20, 21, 22, 23, 24 
     Score: 1s, cave, stream, goat, kiss, mutts, mutt 
Topic 6 Top Words:
     Highest Prob: district, dont, time, people, capitol, mother, tributes 
     FREX: trinket, reaping, table, called, father, people, mother 
     Lift: 220, balcony, caesars, capes, community, delly, garden 
     Score: 220, trinket, reaping, tesserae, bakers, justice, son 
# but we can also turn this into a tidy format so we can do some dataviz with it
tidy_topic_df <- tidy(topic_model)

# now we make another ggplot col facet wrapped by part to tell us the words associated with 
# different topics
tidy_topic_df %>% 
  group_by(topic) %>% 
  # take the top 10 most important terms based on their beta scores
  top_n(10, beta) %>% 
  ungroup() %>% 
  # replace the current column topic with values where we have Topic and then the number 
  # of the topic from before
  mutate(topic = paste0("Topic ", topic),
         # also reorder the terms in the column term based on their beta score within each topic
         term = reorder_within(term, beta, topic)) %>% 
  # begin ggplot using term as x, beta score as y, and fill according to which topic
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  # use geom col
  geom_col(show.legend = F) +
  # facet wrap by topic
  facet_wrap(~ topic, scales = "free_y") +
  # prettier when verticle
  coord_flip() +
  scale_x_reordered()

References

Here’s some links to a bunch of references that helped me construct this:
Topic Modeling

Tidy Text Mining

stringr

Regex

Strings

Happy Text Processing, and may the odds be ever in your favor!

Mini Hacks

Mini Hack 1 : When he sings even the birds stop to listen

Read in the entire Hanging Tree song and tokenize it. Then, identify all of the words that rhyme and subset them. In this case, rhyming means the rhyme scheme of the song (a, b, b, c, b, b). For example, rhyming does include both “tree” and “be”.

Mini Hack 2 : Destroying things is much easier than making them.

Create a wordcloud for each part of the book.

Mini Hack 3 : Here’s some advice. Stay alive.

Do anything we described above with a new corpus of text!

special bonus for getting to the end this is me in 2014 gearing up for mockingjay part 1