CEMA 0907: Statistics in the Real World

# CEMA 0907: Statistics in the Real World
## Data Wrangling
### Anthony Scotina

---

# What have we done so far?

**Data Visualization**

```r
flights_frontier = flights %>%
  filter(carrier == "F9")

ggplot(flights_frontier, aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.2) + 
  labs(x = "Departure Delay (in minutes)", 
       y = "Arrival Delay (in minutes)", 
       title = "Flights from NYC via Frontier Airlines", 
       subtitle = "Year: 2013") + 
  theme_minimal()
```

---

# What have we done so far?

**Data Visualization**

---

# What will we do now?

**Data Wrangling**

```r
*flights_frontier = flights %>%
* filter(carrier == "F9")

ggplot(flights_frontier, aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.2) + 
  labs(x = "Departure Delay (in minutes)", 
       y = "Arrival Delay (in minutes)") + 
  theme_minimal()
```

---

# Data Wrangling in R?

**dplyr** (D-plyer, **NOT** *dipler*)

---

# Needed Packages

```r
library(tidyverse) # Automatically loads ggplot2 and dplyr
library(nycflights13)
library(gapminder)
```

---

# The Pipe Operator `%>%`

---

# The Pipe Operator `%>%`

Say you want to perform a set of operations on the `flights` data frame:

1. Take `flights`, then...

2. `filter` for *only* Jet Blue flights (carrier code "B6"), then...

3. `group_by` departure airport (`origin`), then...

4. `summarize` the *average flight distance* (`distance`) by departure airport.

---

# The Pipe Operator `%>%`

Say you want to perform a set of operations on the `flights` data frame:

Use the **pipe operator** (`%>%`), and write a *pipeline*:

```r
flights %>%
  filter(carrier == "B6") %>%
  group_by(origin) %>%
  summarize(mean_distance = mean(distance))
```

```
# A tibble: 3 × 2
  origin mean_distance
  <chr>          <dbl>
1 EWR             815.
2 JFK            1114.
3 LGA            1030.
```

---

# The Pipe Operator `%>%`

You would read this sequence as:

1. Take `flights`, then...

2. Use this output as the input to the next function, `filter`, then...

3. Use this output as the input to the next function, `group_by`, then...

4. Use this output as the input to the next function, `summarize`.

**The starting value will always be the data frame!**

- The subsequent arguments say *what to do* to the data frame.

- The result will the transformed/modified data frame that you want. 
    
---

# The Pipe Operator `%>%`

In `dplyr`, the pipe operator (`%>%`) plays a very similar role to the `+` operator when adding layers in `ggplot2`.

You can even save "piped" operations as a new data frame. In fact, we have already done it!

**Example** (from earlier): November weather at the JFK airport in New York:

```r
jfk_nov = weather %>%
  filter(origin == "JFK" & month == 11)
```

---

# A Grammar for Data Wrangling

---

# A Grammar for Data Wrangling

Similar to how `ggplot2` provides a *grammar* for **data visualization**...

- `dplyr` provides a *grammar* for **data transformation**.

**Five verbs for working with data in a data frame**:

- `select()`: take a subset of *columns* (e.g., variables)

- `filter()`: take a subset of *rows* (e.g., observations, cases, individuals)

- `mutate()`: add or modify *existing columns*

- `arrange()`: sort rows

- `summarize()`: aggregate data *across rows* (e.g., grouped means)

Each of these functions takes a *data frame* as its first argument (similar to `ggplot()`).

- They can also be used in conjunction with each other (e.g., in a *pipeline*).

---

# `select()` columns

`select()` allows you to return a *subset* of **columns** of a data frame.

- This might not seem like much, but data frames with *many* columns (i.e., hundreds of variables) are not uncommon.

---

# `select()` columns

The `gapminder` data frame in the `gapminder` package contains data on life expectancy, GDP per capita, and population by country, spanning 1952-2007 (in increments of 5 years).

```r
View(gapminder)
```

Suppose we want a data frame with *only* the `country`, `year`, and `lifeExp` variables (columns).

```r
gapminder_new = gapminder %>%
  select(country, year, lifeExp)
```

---

# `filter()` rows

While `select()` is useful in cutting out unwanted columns (especially if there are *many* columns), we might still need to perform other operations on the data.

`filter()` allows you to return a *subset* of **rows** of a data frame.

- Specify criteria about the values of a variables in your dataset.

- `filter()` out only those rows that match that criteria.

---

# `filter()` rows

Using the same `gapminder` data from earlier, let's obtain data for *only the United States*:

```r
gapminder_new = gapminder %>%
  filter(country == "United States")
```

**Reminder**: While `=` (and `<-`) are used for **assignment**, a double `==` is used to *test for equality*.

---

# Filtering on multiple criteria

We are not limited to a single criterion when using `filter()`.

- We can string together multiple criteria with `&` ("and").

```r
gapminder_new = gapminder %>%
  filter(country %in% c("United States", "Canada") &
           year >= 1980)
```

The result of this pipeline is a data frame that includes data on *only* the countries **United States** *or* **Canada**, in **1980 or later**.

**Note**: We use the `%in%` to check for rows with `country` equal to *either* `"United States"` **or** `"Canada"`.

---

# `arrange()` rows

The `arrange()` function will *sort the rows* of a data frame by a pre-specified *column*.

While we can do this as an *exploratory* step in the `View()` window, `arrange()` allows us to perform additional operations on a custom-sorted data frame.

---

# `arrange()` rows

```r
gapminder %>%
  arrange(year, country)
```

```
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Albania     Europe     1952    55.2  1282697     1601.
##  3 Algeria     Africa     1952    43.1  9279525     2449.
##  4 Angola      Africa     1952    30.0  4232095     3521.
##  5 Argentina   Americas   1952    62.5 17876956     5911.
##  6 Australia   Oceania    1952    69.1  8691212    10040.
##  7 Austria     Europe     1952    66.8  6927772     6137.
##  8 Bahrain     Asia       1952    50.9   120447     9867.
##  9 Bangladesh  Asia       1952    37.5 46886859      684.
## 10 Belgium     Europe     1952    68    8730405     8343.
## # … with 1,694 more rows
```

---

# `arrange()` rows (in descending order)

By default, `arrange()` will sort rows in **ascending order** (least to greatest, or *alphabetically*) based on the column(s) specified.

- To sort in **descending order**, use `desc()` around any of the variables specified in `arrange()`.

Let's take *only data from 2007* and sort by the *largest to smallest* life expectancy:

```r
gapminder %>%
  filter(year == 2007) %>%
  arrange(desc(lifeExp)) 
```

---

# `mutate()` existing columns

`mutate()` is a function that defines and inserts *new variables* into a data frame as *columns*.

---

# `mutate()`

In the `weather` data frame from the `nycflights13` package, the variable `temp` is temperature in **degrees Fahrenheit**. To convert this to **degrees Celcius**, we use the formula:

`$$\text{temp in C}=\frac{\text{temp in F}-32}{1.8}$$`

Using `mutate`, we can create the variable `temp_C` using the existing `temp`:

```r
weather = weather %>%
  mutate(temp_C = (temp - 32)/1.8)
View(weather)
```

---

# `mutate()`

```r
weather = weather %>%
  mutate(temp_C = (temp - 32)/1.8)
```

Note that by calling this new data frame `weather`, we *overwrite* the existing one that comes with the `nycflights13` package.

- **Why?** We are not *losing* any information; only *adding* new information.

- The original `temp` variable is still part of the data frame.

---

# `summarize()` with `group_by()`

The *verb-named functions* that we have looked at so far have all provided different ways to **manipulate** a data frame.

- `select()`, `filter()`, `mutate()`, `rename()`, `arrange()`

- The extent of the analysis we can perform *solely by manipulating a data frame* is limited.

`summarize()` and `group_by()` allow us to make calculations and *comparisons*.

- `summarize()` (with or without `group_by()`) collapses the data into rows based on **summary statistics**.

---

# `summarize()`

The important step in `summarize()` is specifying **how** we want to reduce the data to a single set of summary statistics. 
- **Examples**: `mean()`, `sd()`, `median()`, `IQR()`, `max()`, `n()`, and many more

Using the `weather` data frame, let's calculate the **mean** and **standard deviation** of `temp_F`:

```r
weather %>%
  summarize(mean_temp_F = mean(temp, na.rm = TRUE), 
            sd_temp_F = sd(temp, na.rm = TRUE))
```

```
## # A tibble: 1 × 2
##   mean_temp_F sd_temp_F
##         <dbl>     <dbl>
## 1        55.3      17.8
```

**Note**: We added `na.rm = TRUE` to remove `NA` (missing values) from each calculation. 
- Otherwise, the output for `mean()` and `sd()` would also be `NA`.

---

# `summarize()`

Given the nature of the `weather` data frame, why might the summary statistics in the previous slide contain *redundant information*?

```r
weather %>%
  count(origin)
```

```
## # A tibble: 3 × 2
##   origin     n
##   <chr>  <int>
## 1 EWR     8703
## 2 JFK     8706
## 3 LGA     8706
```

Temperature was recorded each hour in 2013 at *three different NYC airports*.

- Presumably, the temperature will be similar at each of the three airports at any given hour.

**Also**, there are definitely *seasonal/monthly* trends in temperature, especially in a state like New York.

---

# `group_by()` rows

This is where `group_by()` comes in!

The `summarize()` function is *almost always* used in conjunction with `group_by()`.

We just used `summarize()` to calculate the **mean** and **standard deviation** of New York temperature in 2013.

- But what if we want to calculate these summary statistics *for each month*?

- We can use `group_by()` in the pipeline to generate *aggregate* summary statistics!

---

# `group_by()`

Before running the code below, *what do you think it does*?

```r
weather_temp_month = weather %>%
  filter(origin == "JFK") %>% # Just focus on one NYC airport
  group_by(month) %>%
  summarize(mean_temp_F = mean(temp, na.rm = TRUE), 
            sd_temp_F = sd(temp, na.rm = TRUE))
weather_temp_month
```

Because **temperature** usually varies by *season*, or *month*, it makes sense to calculate monthly summary statistics.

- We could've also grouped by `origin` (i.e., `group_by(month, origin)`), but the monthly average temperature is probably very similar at the three airports.

---

# `group_by()`

The code on the previous slide is *identical* to the code we used when introducing `summarize()`...

- Except we added the first step `group_by(month)`, which does exactly what it sounds like: it **groups by** *month* before calculating summary statistics.

**Note**: We never use the `group_by()` operation by itself. We always combine it with a function that performs a calculation, such as `summarize()`.

For example, run the following. You'll notice that nothing really changes from the initial data frame.

```r
weather %>%
  group_by(month)
```

---

# Recap: The S.S. FAM

- `summarize()` (usually with `group_by()`): reduce variables to summary values

- `filter()`: pick a subset of rows matching criteria

- `arrange()`: reorder rows

- `mutate()`: add new variables
]

There are a *handful* of other functions we'll cover, but you can do a lot with these five!