class: center, middle, inverse, title-slide # CEMA 0907: Statistics in the Real World ## Extra Topics ### Anthony Scotina --- # Needed Packages ```r library(tidyverse) library(tidytext) # Install this! ``` --- class: center, middle, frame # Sentiment Analysis ## (Anthony's Version) --- # Sentiment Analysis Using *text mining techniques*, one piece of information we can extract from text is the **sentiment of the text**. - Is the text positive or negative? Does it evoke *surprise* or *disgust*? - If we *sum* the sentiment for *each word in a document*, does this reflect an accurate measure of sentiment for the document as a whole? -- There are a variety of ways to judge the **sentiment** of a text. - One such *sentiment lexicon* is the `bing` lexicon from [Bing Liu and collaborators](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html). From `tidytext`: ```r get_sentiments("bing") ``` --- # Taylor Swift I would **Love** to tell a **Story** about the sentiment of Taylor Swift's songs. From [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-29/readme.md) via Rosie Baillie and Dr. Sara Stoudt: ```r taylor_swift_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv') ``` ``` ## Rows: 132 Columns: 4 ``` ``` ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (4): Artist, Album, Title, Lyrics ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` --- # Tidy Text Let's use `unnest_tokens()` to convert the data to *tidy* format! - Each word will appear in its own row. ```r tswift_tidy = taylor_swift_lyrics %>% unnest_tokens(output = word, input = Lyrics) ``` -- First, we'll want to *remove* **stop words**: ```r head(stop_words) # 'stop_words' from 'tidytext' package ``` ``` ## # A tibble: 6 × 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ``` --- # `anti_join()` Here's where we can use `anti_join()`! - *Remove* stop words from `tswift_tidy`. ```r tswift_tidy_anti = tswift_tidy %>% anti_join(stop_words, by = "word") ``` --- # Which words does T. Swift say the most? ```r tswift_tidy_anti %>% count(word, sort = TRUE) %>% slice_max(n, n = 22) %>% ggplot(aes(x = fct_reorder(word, n), y = n)) + geom_col() + labs(x = "", y = "Count", title = "Which words does Taylor Swift LOVE the most?") + coord_flip() + theme_minimal() ``` ![](12-Extra_Topics_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- # Back to Sentiment Analysis The text is in **tidy format** (one word per row), so we can proceed with the **sentiment analysis**. Let's **join** the `bing` lexicon to `tswift_tidy_anti`. - We'll use `inner_join()`, because we want only the words that appear in **both** the song and the `bing` lexicon: ```r song_sentiment = tswift_tidy_anti %>% inner_join(get_sentiments("bing"), by = "word") ``` --- # Taylor Swift Sentiment Analysis Next, we'll `count()` up the `word` and `sentiment` combinations, extract the **top 22** for each sentiment, and pass them off to `ggplot()` and `geom_col()`. - I used `scales = "free_y"` in `facet_wrap()` so that each facet contained a different set of words (positive vs negative sentiments). ```r sentiment_counts = song_sentiment %>% count(word, sentiment, sort = TRUE) sentiment_counts %>% group_by(sentiment) %>% slice_max(order_by = n, n = 10) %>% ggplot(aes(x = fct_reorder(word, n), y = n, fill = sentiment)) + geom_col() + facet_wrap( ~ sentiment, scales = "free_y") + labs(x = "", y = "Contribution to sentiment", title = "Sentiment Analysis of Taylor Swift Songs") + theme_bw() + theme(legend.position = "none") + coord_flip() ``` --- # Taylor Swift Sentiment Analysis .center[ <img src="12-Extra_Topics_files/figure-html/unnamed-chunk-10-1.png" width="65%" /> ] --- class: center, middle # One More Thing ## case_when() --- # `case_when()` Are there more **resort** hotel reservations (compared to **city** hotel reservations) during the *summer* months? This dataset contains open data on **hotel booking demand** via [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md). ```r hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv') ``` -- **Two variables of interest**: - `hotel`: `city` or `resort` - `arrival_date_month`: the *month* (in words) of arrival --- # `case_when()` From `?case_when`: > This function allows you to vectorise multiple `if_else()` statements. -- Let's create a new `season` variable: - If `arrival_date_month %in% c("December", "January", "February")`, then `season = "Winter"` - If `arrival_date_month %in% c("March", "April", "May")`, then `season = "Spring"` - If `arrival_date_month %in% c("June", "July", "August")`, then `season = "Summer"` - If `arrival_date_month %in% c("September", "October", "November")`, then `season = "Fall"` --- # `case_when()` From `?case_when`: > This function allows you to vectorise multiple `if_else()` statements. The general syntax within a `case_when()` statement is: .center[ case_when(`CONDITION1 ~ CATEGORY NAME IF CONDITION1 IS TRUE`, `CONDITION1 ~ CATEGORY NAME IF CONDITION1 IS TRUE` ] ```r mutate(new_variable = case_when(`CONDITION1 ~ CATEGORY NAME IF CONDITION1 IS TRUE`, `CONDITION2 ~ CATEGORY NAME IF CONDITION2 IS TRUE`, ... ) ) ``` --- # `case_when()` From `?case_when`: > This function allows you to vectorise multiple `if_else()` statements. Let's create a new `season` variable: ```r hotels = hotels %>% mutate(season = case_when(arrival_date_month %in% c("December", "January", "February") ~ "Winter", arrival_date_month %in% c("March", "April", "May") ~ "Spring", arrival_date_month %in% c("June", "July", "August") ~ "Summer", arrival_date_month %in% c("September", "October", "November") ~ "Fall")) ``` --- # Hotel Reservations (by season) Now let's make a bar graph using the new `season` variable! ```r hotels %>% ggplot(aes(x = season, fill = hotel)) + geom_bar(position = "dodge") + labs(x = "", y = "", fill = "", title = "Hotel Reservations (by season)") + scale_y_continuous(labels = scales::comma) + coord_flip() + theme_minimal() ```