Text Analysis

May 19, 2025

new package tidytext

library(tidytext)

Survey data file

dat <- read_csv("data/dataAugust14.csv") 
dat <- dat %>%
  mutate(
    ID = seq.int(nrow(dat)))
source("data/coding.R")

Short Open Ended Question


  • What does democracy mean to you? (V_Q6_1)

    • [how do people define democracy, or conceptualize it]

Workflow


  • Read the responses!

  • Tokenize: “Un-nest” the text responses so that each row of the data is a word (or pair of words, or triplet of words, “N-grams”)

  • Remove “stop words” like “the”, etc

  • Analyze/Visualize tokens

  • Code responses into groups if helpful and needed

Tokenization


A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.

Unnest data into tokens

unNest <- dat %>% 
  unnest_tokens(word, V_Q6_1) %>% 
  select(Respondent_Serial, word, rural, ed_binary)

Check out results

head(unNest, n = 10)
# A tibble: 10 × 4
   Respondent_Serial word        rural ed_binary          
               <dbl> <chr>       <dbl> <chr>              
 1            720044 it's            0 Secondary or higher
 2            720044 a               0 Secondary or higher
 3            720044 system          0 Secondary or higher
 4            720044 of              0 Secondary or higher
 5            720044 government      0 Secondary or higher
 6            720044 by              0 Secondary or higher
 7            720044 the             0 Secondary or higher
 8            720044 popuplation     0 Secondary or higher
 9            720044 people          0 Secondary or higher
10            719271 sharing         0 Secondary or higher

Ngrams

unNest_2 <- dat %>% 
  unnest_tokens(word, V_Q6_1, token = "ngrams", n = 2) %>% 
  select(Respondent_Serial, word, rural)

Check out results

head(unNest_2, n = 10)
# A tibble: 10 × 3
   Respondent_Serial word               rural
               <dbl> <chr>              <dbl>
 1            720044 it's a                 0
 2            720044 a system               0
 3            720044 system of              0
 4            720044 of government          0
 5            720044 government by          0
 6            720044 by the                 0
 7            720044 the popuplation        0
 8            720044 popuplation people     0
 9            719271 sharing expressing     0
10            719271 expressing of          0


  • Can also un-nest into sentences, if you had longer text

Remove “stop words”

data(stop_words)
unNest <- unNest %>% 
  anti_join(stop_words)

Check out results

head(unNest, n = 10)
# A tibble: 10 × 4
   Respondent_Serial word        rural ed_binary          
               <dbl> <chr>       <dbl> <chr>              
 1            720044 system          0 Secondary or higher
 2            720044 government      0 Secondary or higher
 3            720044 popuplation     0 Secondary or higher
 4            720044 people          0 Secondary or higher
 5            719271 sharing         0 Secondary or higher
 6            719271 expressing      0 Secondary or higher
 7            719271 ideas           0 Secondary or higher
 8            719271 opinion         0 Secondary or higher
 9            719271 mind            0 Secondary or higher
10            719271 issue           0 Secondary or higher

Remove Stop words from N-Grams

bigrams_separated <- unNest_2 %>%
  separate(word, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")
head(bigrams_united)
# A tibble: 6 × 3
  Respondent_Serial bigram             rural
              <dbl> <chr>              <dbl>
1            720044 popuplation people     0
2            719271 sharing expressing     0
3            719271 ideas opinion          0
4            719271 issue subject          0
5            719271 country political      0
6            720153 sharing expressing     0

Top Words

Code to produce graph

unNest %>% 
  filter(word != "NA") %>%       
  count(word, sort = TRUE) %>%
  filter(n > 20) %>%
  mutate(word = reorder(word, n))%>%
  ggplot(aes(n, word)) +
    geom_col(fill = "steelblue") +
    theme_bw() +
    labs(x = "Count",
         y = "Word",
         title ="What does democracy mean to you?")

Some words are not needed


  • Let’s remove “government” and “political” and “system” and “country”

Top Bigrams

Code
bigrams_united %>% 
  filter(bigram != "NA") %>% 
  count(bigram, sort = TRUE) %>%
  filter(n > 5) %>%
  mutate(bigram = reorder(bigram, n))%>%
  ggplot(aes(n, bigram)) +
    geom_col(fill = "steelblue") +
    theme_bw() +
    labs(x = "Count",
         y = "Bigrams",
         title ="What does democracy mean to you?")

Word clouds

Code
library(wordcloud)
d <- unNest %>% 
  filter(word != "NA") %>% 
  filter(word != "government") %>%
  filter(word != "system") %>%
  filter(word != "political") %>%
    filter(word != "country") %>%
  count(word, sort = TRUE)  %>%  ungroup()  
wordcloud(d$word, d$n, random.order = FALSE, max.words = 50, colors=brewer.pal(8,"Dark2"))

Co-Occurrence in the same response

  • Can use n-grams when words are next to one another.

  • Can also capture when words co-occur in the same response, but not next to one another

library(widyr)
# count words co-occuring within responses
word_pairs <- unNest %>%
  pairwise_count(word, Respondent_Serial, sort = TRUE)
words_united <- word_pairs %>%
  unite(pairs, item1, item2, sep = " ")
words_united
# A tibble: 402 × 2
   pairs                  n
   <chr>              <dbl>
 1 expressing sharing   459
 2 ideas sharing        459
 3 opinion sharing      459
 4 mind sharing         459
 5 issue sharing        459
 6 subject sharing      459
 7 country sharing      459
 8 political sharing    459
 9 sharing expressing   459
10 ideas expressing     459
# ℹ 392 more rows

Quick break

Sentiment Analysis

tidytext includes dictionaries for evaluating the opinion or emotion in text. Three general-purpose lexicons are:

  • AFINN from Finn Årup Nielsen,
  • bing from Bing Liu and collaborators, and
  • nrc from Saif Mohammad and Peter Turney.

Sentiment Analysis

library(textdata)
get_sentiments("nrc")
# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows

Sentiment Analysis

get_sentiments("afinn")
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows

Sentiment Analysis

get_sentiments("bing")
# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows

Connect lexicon to Ghana data

myB <- get_sentiments("bing")
ghanaB <- unNest %>% 
  inner_join(myB) # will merge on word
head(ghanaB)
# A tibble: 6 × 5
  Respondent_Serial word  rural ed_binary           sentiment
              <dbl> <chr> <dbl> <chr>               <chr>    
1            719271 issue     0 Secondary or higher negative 
2            720153 issue     0 Secondary or higher negative 
3            720750 issue     0 Primary or lower    negative 
4            720852 issue     0 Primary or lower    negative 
5            722984 issue     1 Secondary or higher negative 
6            722988 issue     1 Primary or lower    negative 

Common Words with Positive Sentiments

Common Words with Negative Sentiments

New Example

The Office data

library(schrute)

The Office data

glimpse(theoffice)
Rows: 55,130
Columns: 12
$ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name     <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director         <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer           <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character        <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text             <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating      <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes      <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date         <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…

Filter by character

scott <- theoffice %>% 
  filter(character == "Michael")

Tokenize

scott_unnest <- scott %>% 
  unnest_tokens(word, text) 

Remove Stop Words

data(stop_words)
scott_unnest <-scott_unnest %>% 
  anti_join(stop_words)

Top Words

Code
scott_unnest %>% 
  filter(word != "NA") %>% 
  count(word, sort = TRUE) %>%
  filter(n > 150) %>%
  mutate(word = reorder(word, n))%>%
  ggplot(aes(n, word)) +
    geom_col(fill = "dodgerblue") +
    theme_bw() +
    labs(x = "Count",
         y = "Word",
         title ="Michael Scott's Top Words")

N-Grams (4 words)

unNest_2 <- scott %>% 
  unnest_tokens(word, text, token = "ngrams", n = 4) 

N-Grams

Code
unNest_2 %>% 
  filter(word != "NA") %>% 
  count(word, sort = TRUE) %>%
  filter(n > 20) %>%
  mutate(word = reorder(word, n))%>%
  ggplot(aes(n, word)) +
    geom_col(fill = "steelblue") +
    theme_bw() +
    labs(x = "Count",
         y = "4-word phrase",
         title ="Michael Scott's Most Frequent Phrases",
         subtitle = "4 word N-grams")

Sentiment Analysis

myBing <- get_sentiments("bing")
scottBing <- scott_unnest %>% 
  inner_join(myBing) # will merge on word

Most Common Positive Words

Code
scottBing %>% 
  filter(word != "NA") %>% 
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE) %>%
  filter(n > 25) %>%
  mutate(word = reorder(word, n))%>%
  ggplot(aes(n, word)) +
    geom_col(fill = "dodgerblue") +
    theme_bw() +
    labs(x = "Count",
         y = "Word",
         title ="Michael Scott's Most Common Positive Words")

Most Common Negative Words

Code
scottBing %>% 
  filter(word != "NA") %>% 
  filter(sentiment == "negative") %>% 
  count(word, sort = TRUE) %>%
  filter(n > 25) %>%
  mutate(word = reorder(word, n))%>%
  ggplot(aes(n, word)) +
    geom_col(fill = "darkred") +
    theme_bw() +
    labs(x = "Count",
         y = "Word",
         title =" Michael Scott's Most Common Negative Words")

Posit Cloud

IAFF 6501 Website

1 / 43
Text Analysis May 19, 2025

  1. Slides

  2. Tools

  3. Close
  • Text Analysis
  • new package tidytext
  • Survey data file
  • Short Open Ended Question
  • Workflow
  • Tokenization
  • Unnest data into tokens
  • Check out results
  • Ngrams
  • Check out results
  • Can also un-nest...
  • Remove “stop words”
  • Check out results
  • Remove Stop words from N-Grams
  • Top Words
  • Code to produce graph
  • Some words are not needed
  • Slide 18
  • Slide 19
  • Top Bigrams
  • Word clouds
  • Co-Occurrence in the same response
  • Quick break
  • Sentiment Analysis
  • Sentiment Analysis
  • Sentiment Analysis
  • Sentiment Analysis
  • Connect lexicon to Ghana data
  • Common Words with Positive Sentiments
  • Common Words with Negative Sentiments
  • New Example
  • The Office data
  • The Office data
  • Filter by character
  • Tokenize
  • Remove Stop Words
  • Top Words
  • N-Grams (4 words)
  • N-Grams
  • Sentiment Analysis
  • Most Common Positive Words
  • Most Common Negative Words
  • Posit Cloud
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help