CEMA 0907: Statistics in the Real World

# CEMA 0907: Statistics in the Real World
## Tidy Data
### Anthony Scotina

---

# Tidy Data

---

# Needed Packages

```r
library(tidyverse)
library(fivethirtyeight)
library(nycflights13)

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
```

(We might use the [spotify_songs](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md) dataset in a bit!)

---

# `bush_approval` data

Download the `bush_approval` data from Canvas, and import it into R.

The `bush_approval` data frame consists of approval ratings (out of 100%) of former US President George W. Bush.

- `date`: date of poll
- `Approve`: overall percent who approve of President Bush
- `Disapprove`: overall percent who disapprove of President Bush
- `Republican`: percent approval among Republican voters
- `Independent`: percent approval among Independent voters
- `Democrat`: percent approval among Democrat voters
- `Pres.Party`: President Bush's political party

Simplify the data a bit:

```r
bush_approval = bush_approval %>%
  select(date, Republican, Independent, Democrat)
```

---

# Wide Data

```r
head(bush_approval, 5)
```

```
## # A tibble: 5 × 4
##   date       Republican Independent Democrat
##   <date>          <dbl>       <dbl>    <dbl>
## 1 2009-01-13         75           6       28
## 2 2008-12-16         67           7       25
## 3 2008-12-09         72           9       26
## 4 2008-11-18         66           6       28
## 5 2008-11-11         63           5       21
```

The `bush_approval` data are currently in **wide format**.

- Each party's approval of Bush is located in a separate column.

If *visual inspection* is the primary analytical technique, this format can be convenient.

- However, it can be more difficult to implement anything other than *visual inspection*.

---

# Tidy Data

As is, the `bush_approval` data are not **tidy**!

From [Wickham (2014)](https://vita.had.co.nz/papers/tidy-data.pdf)
  
1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

---

# Tidy Data

Why is `bush_approval` *not* tidy?

```r
head(bush_approval, 5)
```

```
# A tibble: 5 × 4
  date       Republican Independent Democrat
  <date>          <dbl>       <dbl>    <dbl>
1 2009-01-13         75           6       28
2 2008-12-16         67           7       25
3 2008-12-09         72           9       26
4 2008-11-18         66           6       28
5 2008-11-11         63           5       21
```

What are the **three unique pieces of information** in this data frame?

1. Date
2. Approval Rating
3. Political Party

But... there are more than three columns!

---

# Tidy Data

We want to end up with this:

```r
bush_tidy
```

```
# A tibble: 849 × 3
   date       Party      Approval
   <date>     <chr>         <dbl>
 1 2009-01-13 Republican       75
 2 2008-12-16 Republican       67
 3 2008-12-09 Republican       72
 4 2008-11-18 Republican       66
 5 2008-11-11 Republican       63
 6 2008-11-04 Republican       61
 7 2008-10-28 Republican       63
 8 2008-10-14 Republican       57
 9 2008-10-07 Republican       55
10 2008-09-29 Republican       64
# … with 839 more rows
```

---

# How do we convert to "tidy" data?

---

# `pivot_longer()`

We convert a data frame to "tidy" (long format) by using the `pivot_longer()` function from the `tidyr` package.

- The `tidyr` package loads with the `tidyverse`.

```r
head(bush_approval, 5)
```

```r
bush_tidy = bush_approval %>%
  pivot_longer(names_to = "Party", 
               values_to = "Approval", 
               cols = -date)
```

---

# `pivot_longer()`

```r
bush_tidy = bush_approval %>%
  pivot_longer(names_to = "Party", 
               values_to = "Approval", 
               cols = -date)
bush_tidy
```

```
# A tibble: 849 × 3
   date       Party       Approval
   <date>     <chr>          <dbl>
 1 2009-01-13 Republican        75
 2 2009-01-13 Independent        6
 3 2009-01-13 Democrat          28
 4 2008-12-16 Republican        67
 5 2008-12-16 Independent        7
 6 2008-12-16 Democrat          25
 7 2008-12-09 Republican        72
 8 2008-12-09 Independent        9
 9 2008-12-09 Democrat          26
10 2008-11-18 Republican        66
# … with 839 more rows
```

---

# `pivot_longer()`

The `pivot_longer()` function takes the following *arguments*:

- `names_to` is the name of the column/variable in the new "tidy" frame that contains the column names of the original data frame that you want to tidy.
    - When we set `names_to = Party`, the `Party` column in `bush_tidy` contains the original Party names from `bush_approval`.

- `values_to` is the name of the column/variable in the "tidy" frame that contains the rows and columns of values in the original data frame you want to tidy. 
    - When we set `values_to = Approval`, the `Approval` column in `bush_tidy` contains all of the approval ratings from `bush_approval`.

- The third argument, `cols`, is the column(s) you either want to or don’t want to tidy. 
    - The `date` variable was already tidy, so we didn't need to further modify it. 
    
---

# Benefits of Tidy Data

While we sacrificed our ability to see the data *all at once*, it is much more straightforward to use *data visualization* and *data wrangling* techniques with **tidy data**.

```r
ggplot(data = bush_tidy, aes(x = date, y = Approval, color = Party)) + 
  geom_line(size = 1.5) + 
  labs(x = "", title = "Approval Rate for George W. Bush", subtitle = "by party")
```

---

# Going from tidy to wide format

If you have a *tidy* dataset but want to transform it to *wide format*, use the `pivot_wider()` function:

```r
bush_wide = bush_tidy %>%
  pivot_wider(names_from = "Party", 
              values_from = "Approval")
head(bush_wide, 5)
```

The `pivot_wider()` syntax is nearly *identical* to `pivot_longer()`, except with the `names_from` and `values_from` arguments.

---

# Data Cleaning: Tips and Tricks

## (Anthony's Favorite Things)

---

# Counting Observations per group

```r
video_games <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-30/video_games.csv")
```

- `number`:	Game number
- `game`: Game Title
- `release_date`:	Release date
- `price`: US Dollars + Cents
- `owners`: Estimated number of people owning this game
- `developer`: Group that developed the game
- `publisher`: Group that published the game
- `average_playtime`: Average playtime in minutes
- `median_playtime`: Median playtime in minutes
- `metascore`: Metascore rating

**Question**: How would you count the number of games *per publisher*, and then *arrange* by count?

---

# Counting Observations per group

Using some of `dplyr`'s data wrangling functions, we can do this in several lines:

```r
video_games %>%
  group_by(publisher) %>%
  summarize(N_games = n()) %>%
  arrange(desc(N_games))
```

This pipeline:

1. **groups by** `publisher`,

2. **summarizes** the data by simply counting the *number of rows* (`n()`) per publisher, and

3. **arranges in descending order** by the newly-created `N_games`.

---

# `count()`

The `count()` function is one of my **favorites**!

- `count()` simply *count* the number of observations **by group**.

```r
video_games %>%
  count(publisher)
```

- This returns a table with the groups and the number of observations per group.

---

# `count()`

If you want to `arrange` by `n` (the number of observations per group), you *could* use arrange...

```r
video_games %>%
  count(publisher) %>%
  arrange(desc(n))
```

---

# `count()`

You *could also* just use the `sort = TRUE` option in `count()` to sort in **descending order**:

```r
video_games %>%
  count(publisher, sort = TRUE)
```

```
## # A tibble: 13,954 × 2
##    publisher               n
##    <chr>               <int>
##  1 Big Fish Games        284
##  2 SEGA                  141
##  3 Strategy First        129
##  4 Ubisoft               123
##  5 Square Enix            95
##  6 <NA>                   95
##  7 Sekai Project          94
##  8 Choice of Games        92
##  9 Dagestan Technology    89
## 10 1C Entertainment       85
## # … with 13,944 more rows
```

---

# Lumping Factors

<strike>Let's make a **barplot** of *games per publisher*:</strike>

**Caution**: Don't actually make a barplot. There are 13,954 *unique publishers* (as seen in the previous `count()` table).

```r
ggplot(video_games, aes(x = publisher)) + 
  geom_bar() + 
  coord_flip()
```

---

# `fct_lump()`

The `fct_lump()` function is one of my **favorites**!

- `fct_lump()` is a `forcats` package function (part of the `tidyverse`) that **lumps together** the least common *factor* levels into a level named `"Other"`.

```r
video_games %>%
* mutate(publisher_lump = fct_lump(publisher, n = 5)) %>%
  count(publisher_lump, sort = TRUE)
```

```
## # A tibble: 7 × 2
##   publisher_lump     n
##   <fct>          <int>
## 1 Other          25821
## 2 Big Fish Games   284
## 3 SEGA             141
## 4 Strategy First   129
## 5 Ubisoft          123
## 6 Square Enix       95
## 7 <NA>              95
```

---

# `fct_lump()`

**Now** we can make a barplot without R crashing!

```r
video_games %>%
  mutate(publisher_lump = fct_lump(publisher, n = 5)) %>% 
  ggplot(aes(x = publisher_lump)) + 
  geom_bar() + 
  labs(x = "Publisher", y = "Count") + 
  coord_flip()
```

---

# `fct_reorder()` with `geom_col()`

Let's stay with `forcats` for a bit.

It is usually preferable to arrange a *barplot* (or *boxplots*/*stripplots*) so that the bars are in **ascending** or **descending order** based on the *y*-axis variable.

- Using `fct_reorder()` within `ggplot()` (with `geom_col()`) can get us there!

```r
video_games %>%
  mutate(publisher_lump = fct_lump(publisher, n = 5)) %>% 
  filter(publisher_lump != "Other") %>%
  count(publisher_lump) %>% 
  ggplot(aes(x = fct_reorder(publisher_lump, n), y = n)) + 
  geom_col() + 
  labs(x = "Publisher", y = "Count") + 
  coord_flip()
```

---

# `fct_reorder()` with `geom_col()`

You can also use `fct_reorder()` with a *summary statistic*, so that the bars in `geom_col()` are arranged based on the statistic (instead of *Count*):

```r
video_games %>%
  mutate(publisher_lump = fct_lump(publisher, n = 5)) %>% 
  filter(publisher_lump != "Other") %>%
* group_by(publisher_lump) %>%
* summarize(mean_price = mean(price, na.rm = TRUE)) %>%
  ggplot(aes(x = fct_reorder(publisher_lump, mean_price), y = mean_price)) + 
  geom_col() + 
  labs(x = "Publisher", y = "Average Price (in $, by publisher)") + 
  coord_flip()
```

---

# `separate()`

`separate()` is a `tidyr` function that can be used if you convert a *character* column into multiple other character columns.

We'll use the `nyc_restaurants` data from *TidyTuesday* (more on this dataset [HERE](https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-12-11)):

```r
nyc_restaurants <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-12-11/nyc_restaurants.csv")
```

```r
View(nyc_restaurants)
```

One variable, `inspection_type`, is defined as follows:

> A combination of the inspection program and the type of inspection performed; See Data Dictionary for full list of expected values

---

# `separate()`

There are **a lot** of different `inspection_type` categories:

```r
nyc_restaurants %>%
  count(inspection_type, sort = TRUE) 
```

```
## # A tibble: 33 × 2
##    inspection_type                                        n
##    <chr>                                              <int>
##  1 Cycle Inspection / Initial Inspection             171657
##  2 Cycle Inspection / Re-inspection                   73207
##  3 Pre-permit (Operational) / Initial Inspection      20896
##  4 Pre-permit (Operational) / Re-inspection            9260
##  5 Administrative Miscellaneous / Initial Inspection   5513
##  6 Cycle Inspection / Reopening Inspection             3426
##  7 Pre-permit (Non-operational) / Initial Inspection   2936
##  8 Smoke-Free Air Act / Initial Inspection             2530
##  9 Administrative Miscellaneous / Re-inspection        1765
## 10 Trans Fat / Initial Inspection                      1747
## # … with 23 more rows
```

---

# `separate()`

This means that making a useful visualization with all of these categories could be challenging!

```r
nyc_restaurants %>%
  count(inspection_type, sort = TRUE) %>%
  ggplot(aes(x = fct_reorder(inspection_type, n), y = n)) + 
  geom_col() + 
  labs(x = "", y = "Count") +
  coord_flip() 
```

---

# `separate()`

Let's take `inspection_type` and `separate()` it into two columns: `inspection_program` and `inspection_type`:

```r
nyc_restaurants_sep = nyc_restaurants %>%
  separate(inspection_type, 
           into = c("inspection_program", "inspection_type"), 
           sep = "/")
```

**Notes**

- The `into` argument gives the names of the **new columns** into which we are *separating* the original column.

- The `sep` argument is the character that serves as the *separator* between columns. 
    - In the original dataset, the `inspection_type` variable includes the different pieces of information separated by a `"/"`.
    
---

# `separate()`

Another way to **reorder** categories by frequency is by `fct_infreq()`. 
- With this we'll use `geom_bar()` instead of `geom_col()`, since we aren't first creating a *summary table*.

```r
nyc_restaurants_sep %>%
  filter(!is.na(inspection_program)) %>%
* ggplot(aes(x = fct_infreq(inspection_program))) +
  geom_bar(aes(fill = inspection_type), color = "black") + 
  labs(x = "", y = "Count", fill = "Inspection Type") + 
  coord_flip() + 
  theme_bw()
```

---

# `scales::comma`

Many visualizations will have an axis that uses a *continuous numerical* scale. By default, R does not add commas to large numbers.

- We can fix this by using `label = scales::comma` within `scale_x_continuous()` or `scale_y_continuous()`:

---

# Filtering Using Text Expressions

The `violation_description` column contains a **plain text** description of each unique violation.

- Let's see which **boroughs** have restaurants with the most *roach-related* violations!

We can use one of many `stringr` package (part of the `tidyverse`) functions to extract information about a particular *text pattern* from a character vector.

In particular, `str_detect()` detects the *presence* of a pattern within a string of text.

```r
?str_detect
```

---

# Filtering Using Text Expressions

We will use `str_detect()` in combination with `mutate()` to create a `logical` variable for whether or not each restaurant had a roach-related violation:

- `str_detect()` returns `TRUE` or `FALSE` depending on whether the given `pattern` appears within the `string`.

```r
nyc_restaurants = nyc_restaurants %>%
  mutate(roach_violation =
    str_detect(string = violation_description, 
               pattern = "roach"))
```

---

# Roach Violations in NY

```r
nyc_restaurants %>%
  filter(roach_violation == TRUE) %>%
  ggplot(aes(x = fct_infreq(boro))) + 
  geom_bar() + 
  labs(x = "", y = "Count", 
       title = "Which NYC Boroughs have the most Roach-related Violations?") +
  coord_flip() + 
  theme_bw() 
```

---

# `case_when()`

Are there more **resort** hotel reservations (compared to **city** hotel reservations) during the *summer* months?

This dataset contains open data on **hotel booking demand** via [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md).

```r
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
```

**Two variables of interest**:

- `hotel`: `city` or `resort`
- `arrival_date_month`: the *month* (in words) of arrival

---

# `case_when()`

From `?case_when`:

> This function allows you to vectorise multiple `if_else()` statements.

Let's create a new `season` variable:

- If `arrival_date_month %in% c("December", "January", "February")`, then `season = "Winter"`

- If `arrival_date_month %in% c("March", "April", "May")`, then `season = "Spring"`

- If `arrival_date_month %in% c("June", "July", "August")`, then `season = "Summer"`

- If `arrival_date_month %in% c("September", "October", "November")`, then `season = "Fall"`

---

# `case_when()`

From `?case_when`:

> This function allows you to vectorise multiple `if_else()` statements.

The general syntax within a `case_when()` statement is:

.center[
case_when(`CONDITION1 ~ CATEGORY NAME IF CONDITION1 IS TRUE`, 
`CONDITION1 ~ CATEGORY NAME IF CONDITION1 IS TRUE`
]

```r
mutate(new_variable = 
         case_when(`CONDITION1 ~ CATEGORY NAME IF CONDITION1 IS TRUE`, 
                   `CONDITION2 ~ CATEGORY NAME IF CONDITION2 IS TRUE`, 
                   ... )
       )
```

---

# `case_when()`

From `?case_when`:

> This function allows you to vectorise multiple `if_else()` statements.

Let's create a new `season` variable:

```r
hotels = hotels %>%
  mutate(season = 
           case_when(arrival_date_month %in% c("December", "January", "February") ~ "Winter", 
                     arrival_date_month %in% c("March", "April", "May") ~ "Spring", 
                     arrival_date_month %in% c("June", "July", "August") ~ "Summer", 
                     arrival_date_month %in% c("September", "October", "November") ~ "Fall"))
```

---

# Hotel Reservations (by season)

Now let's make a bar graph using the new `season` variable!

```r
hotels %>%
  ggplot(aes(x = season, fill = hotel)) + 
  geom_bar(position = "dodge") + 
  labs(x = "", y = "", fill = "",
       title = "Hotel Reservations (by season)") + 
  scale_y_continuous(labels = scales::comma) +
  coord_flip() 
```