class: center, middle, inverse, title-slide # CEMA 0907: Statistics in the Real World ## Data Wrangling on Multiple Tables ### Anthony Scotina --- # Needed Packages ```r library(tidyverse) library(nycflights13) ``` --- class: center, middle, frame # Combining Multiple Tables --- # Combining Multiple Tables So far we have been working with *single data frames*, which are reminiscent of single *spreadsheets/tables*. - However, many data projects contain *multiple tables* used together. - Luckily, `dplyr` gives efficient means to work with multiple tables. -- One family of verbs that we will use to work with multiple tables is called **mutating joins**. - These add new variables to one data frame from *matching observations* in another (i) `inner_join()` (ii) `full_join()` (iii) `left_join()` and `right_join()` --- # `nycflights13` The `nycflights13` package contains *five data frames*, though we've mainly worked with two. - `flights`, `weather`, `airlines`, `airports`, `planes` -- From *R for Data Science*: .center[ <img src="relational_nycflights.png" width="380" /> ] --- # `inner_join()` Let's append the **destination** airport names via the `airports` table (run `View(airports)`). The **matching key** between the `flights` and `airports` tables are: - `dest` (from `flights`) - `faa` (from `airports`) -- ```r flightsJoinedDest = flights %>% inner_join(airports, by = c("dest" = "faa")) nrow(flights) ``` ``` ## [1] 336776 ``` ```r nrow(flightsJoinedDest) ``` ``` ## [1] 329174 ``` -- But... `\(329174\neq336776\)`! This means that there were *some* destination airports in `flights` that didn't actually appear in `airports`. --- # `left_join()` We use `left_join()` when we want to include **all** rows in one table (the "left" table), and *only* rows in the other table with a matching "key" variable. - `right_join()` works similarly, just in the other direction. ```r flightsJoinedLeft = flights %>% left_join(airports, by = c("dest" = "faa")) nrow(flights) ``` ``` ## [1] 336776 ``` ```r nrow(flightsJoinedLeft) ``` ``` ## [1] 336776 ``` -- If a value of `dest` **does not** appear in `airports`, the `airports` columns will appear as `NA` in the left-joined table. --- # A Closer Inspection We can use the `setdiff()` function to see which elements are in one vector, but not the other. ```r setdiff(flights$dest, airports$faa) ``` ``` ## [1] "BQN" "SJU" "STT" "PSE" ``` -- **From Google**: `BQN`: Rafael Hernandez Airport (in Puerto Rico) `SJU`: Luis Muñoz Marín International Airport (in Puerto Rico) `STT`: Cyril E. King Airport (in the U.S. Virgin Islands) `PSE`: Mercedita/Ponce International Airport (in Puerto Rico) -- This means that any flights recorded in the `flights` table that had destinations (i.e., `dest`) to any of these airports cannot be joined with the `airports` table. - Because these `dest` airports are not included in the `airports` table! --- # `left_join()` When using `left_join()` between `flights` and `airports`, the four airports *that don't appear in* `airports` **will still be included in the joined table**. - Their `tzone` (and all other values in `airports`) will be `NA`. ```r flights %>% left_join(airports, by = c("dest" = "faa")) %>% filter(dest %in% c("LAX", "BQN", "SJU", "STT", "PSE")) %>% select(dest, dep_delay, name, lat, lon, tzone) %>% head(5) ``` ``` ## # A tibble: 5 × 6 ## dest dep_delay name lat lon tzone ## <chr> <dbl> <chr> <dbl> <dbl> <chr> ## 1 BQN -1 <NA> NA NA <NA> ## 2 LAX -2 Los Angeles Intl 33.9 -118. America/Los_Angeles ## 3 SJU 0 <NA> NA NA <NA> ## 4 SJU -2 <NA> NA NA <NA> ## 5 LAX -2 Los Angeles Intl 33.9 -118. America/Los_Angeles ``` --- # `right_join()` We use `right_join()` when we want to include **all** rows in one table (the "right" table), and *only* rows in the other table with a matching "key" variable. - `left_join()` works similarly, just in the other direction. ```r flightsJoinedRight = flights %>% right_join(airports, by = c("dest" = "faa")) nrow(flights) ``` ``` ## [1] 336776 ``` ```r nrow(flightsJoinedRight) ``` ``` ## [1] 330531 ``` ```r setdiff(airports$faa, flights$dest) # Too many! ``` --- # `full_join()` `full_join()` works similarly to `inner_join()`, it just doesn't remove *any* rows. - If any rows from either table **DO NOT** share a matching key, those will still be included with `NA` values under the columns from the *other* table. ```r flightsJoinedFull = flights %>% full_join(airports, by = c("dest" = "faa")) nrow(flights) ``` ``` ## [1] 336776 ``` ```r nrow(flightsJoinedFull) ``` ``` ## [1] 338133 ``` -- This included *all* of the airports that did not appear in `flights`. - Since these airports were *not* destinations from NYC in 2013, their flight information includes `NA` values. --- # Putting it all together **Let's compute the number of flights from NYC to each destination, but also include information about each destination airport (i.e., the full name)** -- ```r dest_counts = flights %>% group_by(dest) %>% summarize(num_flights = n()) %>% arrange(desc(num_flights)) %>% left_join(airports, by = c("dest" = "faa")) %>% rename(airport_name = name) ``` --- # Putting it all together **Let's compute the number of flights from NYC to each destination, but also include information about each destination airport (i.e., the full name)** 1. `summarize()` calculated the *number* of flights to *each destination* (using `group_by()`) 2. `arrange()` sorted flight counts in *descending order* based on most "popular" destination 3. `left_join()` linked the airport `name` in the `airports` data frame to the summarized data frame based on the *keys* `dest` and `faa`. - `NA` values were returned if a `dest` was not in `airports` 4. `rename()` renamed the `name` column to `airport_name` for clarity.