CEMA 0907: Statistics in the Real World

class: center, middle, inverse, title-slide

# CEMA 0907: Statistics in the Real World
## Data Wrangling on Multiple Tables
### Anthony Scotina

---

# Needed Packages

```r
library(tidyverse)
library(nycflights13)
```

---

class: center, middle, frame

# Combining Multiple Tables

---

# Combining Multiple Tables

So far we have been working with *single data frames*, which are reminiscent of single *spreadsheets/tables*.

- However, many data projects contain *multiple tables* used together.

- Luckily, `dplyr` gives efficient means to work with multiple tables.

One family of verbs that we will use to work with multiple tables is called **mutating joins**.

- These add new variables to one data frame from *matching observations* in another

(i) `inner_join()`

(ii) `full_join()`

(iii) `left_join()` and `right_join()`
    
---

# `nycflights13`

The `nycflights13` package contains *five data frames*, though we've mainly worked with two. 
- `flights`, `weather`, `airlines`, `airports`, `planes`

From *R for Data Science*:

.center[
<img src="relational_nycflights.png" width="380" />
]

---

# `inner_join()`

Let's append the **destination** airport names via the `airports` table (run `View(airports)`).

The **matching key** between the `flights` and `airports` tables are:

- `dest` (from `flights`)
- `faa` (from `airports`)

```r
flightsJoinedDest = flights %>%
  inner_join(airports, by = c("dest" = "faa"))

nrow(flights)
```

```
## [1] 336776
```

```r
nrow(flightsJoinedDest)
```

```
## [1] 329174
```

But... `$329174\neq336776$`! This means that there were *some* destination airports in `flights` that didn't actually appear in `airports`.

---

# `left_join()`

We use `left_join()` when we want to include **all** rows in one table (the "left" table), and *only* rows in the other table with a matching "key" variable.

- `right_join()` works similarly, just in the other direction.

```r
flightsJoinedLeft = flights %>%
  left_join(airports, by = c("dest" = "faa"))

nrow(flights)
```

```
## [1] 336776
```

```r
nrow(flightsJoinedLeft)
```

```
## [1] 336776
```

If a value of `dest` **does not** appear in `airports`, the `airports` columns will appear as `NA` in the left-joined table.

---

# A Closer Inspection

We can use the `setdiff()` function to see which elements are in one vector, but not the other.

```r
setdiff(flights$dest, airports$faa)
```

```
## [1] "BQN" "SJU" "STT" "PSE"
```

**From Google**:

`BQN`: Rafael Hernandez Airport (in Puerto Rico)

`SJU`: Luis Muñoz Marín International Airport (in Puerto Rico)

`STT`: Cyril E. King Airport (in the U.S. Virgin Islands)

`PSE`: Mercedita/Ponce International Airport (in Puerto Rico)

This means that any flights recorded in the `flights` table that had destinations (i.e., `dest`) to any of these airports cannot be joined with the `airports` table.

- Because these `dest` airports are not included in the `airports` table!

---

# `left_join()`

When using `left_join()` between `flights` and `airports`, the four airports *that don't appear in* `airports` **will still be included in the joined table**.

- Their `tzone` (and all other values in `airports`) will be `NA`.

```r
flights %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  filter(dest %in% c("LAX", "BQN", "SJU", "STT", "PSE")) %>%
  select(dest, dep_delay, name, lat, lon, tzone) %>%
  head(5)
```

```
## # A tibble: 5 × 6
##   dest  dep_delay name               lat   lon tzone              
##   <chr>     <dbl> <chr>            <dbl> <dbl> <chr>              
## 1 BQN          -1 <NA>              NA     NA  <NA>               
## 2 LAX          -2 Los Angeles Intl  33.9 -118. America/Los_Angeles
## 3 SJU           0 <NA>              NA     NA  <NA>               
## 4 SJU          -2 <NA>              NA     NA  <NA>               
## 5 LAX          -2 Los Angeles Intl  33.9 -118. America/Los_Angeles
```

---

# `right_join()`

We use `right_join()` when we want to include **all** rows in one table (the "right" table), and *only* rows in the other table with a matching "key" variable.

- `left_join()` works similarly, just in the other direction.

```r
flightsJoinedRight = flights %>%
  right_join(airports, by = c("dest" = "faa"))

nrow(flights)
```

```
## [1] 336776
```

```r
nrow(flightsJoinedRight)
```

```
## [1] 330531
```

```r
setdiff(airports$faa, flights$dest) # Too many!
```

---

# `full_join()`

`full_join()` works similarly to `inner_join()`, it just doesn't remove *any* rows.

- If any rows from either table **DO NOT** share a matching key, those will still be included with `NA` values under the columns from the *other* table.

```r
flightsJoinedFull = flights %>%
  full_join(airports, by = c("dest" = "faa"))

nrow(flights)
```

```
## [1] 336776
```

```r
nrow(flightsJoinedFull)
```

```
## [1] 338133
```

This included *all* of the airports that did not appear in `flights`.

- Since these airports were *not* destinations from NYC in 2013, their flight information includes `NA` values.

---

# Putting it all together

**Let's compute the number of flights from NYC to each destination, but also include information about each destination airport (i.e., the full name)**

```r
dest_counts = flights %>%
  group_by(dest) %>%
  summarize(num_flights = n()) %>%
  arrange(desc(num_flights)) %>%
  left_join(airports, by = c("dest" = "faa")) %>%
  rename(airport_name = name) 
```

---

# Putting it all together

**Let's compute the number of flights from NYC to each destination, but also include information about each destination airport (i.e., the full name)**

1. `summarize()` calculated the *number* of flights to *each destination* (using `group_by()`)

2. `arrange()` sorted flight counts in *descending order* based on most "popular" destination

3. `left_join()` linked the airport `name` in the `airports` data frame to the summarized data frame based on the *keys* `dest` and `faa`. 
    - `NA` values were returned if a `dest` was not in `airports`

4. `rename()` renamed the `name` column to `airport_name` for clarity.