class: center, middle, inverse, title-slide # CEMA 0907: Statistics in the Real World ## Data Wrangling ### Anthony Scotina --- # What have we done so far? **Data Visualization** ```r flights_frontier = flights %>% filter(carrier == "F9") ggplot(flights_frontier, aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) + labs(x = "Departure Delay (in minutes)", y = "Arrival Delay (in minutes)", title = "Flights from NYC via Frontier Airlines", subtitle = "Year: 2013") + theme_minimal() ``` --- # What have we done so far? **Data Visualization** <img src="03-Data_Wrangling_files/figure-html/unnamed-chunk-3-1.png" width="60%" /> --- # What will we do now? **Data Wrangling** ```r *flights_frontier = flights %>% * filter(carrier == "F9") ggplot(flights_frontier, aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) + labs(x = "Departure Delay (in minutes)", y = "Arrival Delay (in minutes)") + theme_minimal() ``` --- # Data Wrangling in R? **dplyr** (D-plyer, **NOT** *dipler*) .center[ <img src="dplyr_hex.png" width="40%" /> ] --- # Needed Packages ```r library(tidyverse) # Automatically loads ggplot2 and dplyr library(nycflights13) library(gapminder) ``` --- class: middle, center # The Pipe Operator `%>%` --- # The Pipe Operator `%>%` Say you want to perform a set of operations on the `flights` data frame: 1. Take `flights`, then... -- 2. `filter` for *only* Jet Blue flights (carrier code "B6"), then... -- 3. `group_by` departure airport (`origin`), then... -- 4. `summarize` the *average flight distance* (`distance`) by departure airport. --- # The Pipe Operator `%>%` Say you want to perform a set of operations on the `flights` data frame: Use the **pipe operator** (`%>%`), and write a *pipeline*: ```r flights %>% filter(carrier == "B6") %>% group_by(origin) %>% summarize(mean_distance = mean(distance)) ``` ``` # A tibble: 3 × 2 origin mean_distance <chr> <dbl> 1 EWR 815. 2 JFK 1114. 3 LGA 1030. ``` --- # The Pipe Operator `%>%` You would read this sequence as: 1. Take `flights`, then... 2. Use this output as the input to the next function, `filter`, then... 3. Use this output as the input to the next function, `group_by`, then... 4. Use this output as the input to the next function, `summarize`. -- **The starting value will always be the data frame!** - The subsequent arguments say *what to do* to the data frame. - The result will the transformed/modified data frame that you want. --- # The Pipe Operator `%>%` In `dplyr`, the pipe operator (`%>%`) plays a very similar role to the `+` operator when adding layers in `ggplot2`. You can even save "piped" operations as a new data frame. In fact, we have already done it! **Example** (from earlier): November weather at the JFK airport in New York: ```r jfk_nov = weather %>% filter(origin == "JFK" & month == 11) ``` --- class: middle, center # A Grammar for Data Wrangling --- # A Grammar for Data Wrangling Similar to how `ggplot2` provides a *grammar* for **data visualization**... - `dplyr` provides a *grammar* for **data transformation**. -- **Five verbs for working with data in a data frame**: - `select()`: take a subset of *columns* (e.g., variables) - `filter()`: take a subset of *rows* (e.g., observations, cases, individuals) - `mutate()`: add or modify *existing columns* - `arrange()`: sort rows - `summarize()`: aggregate data *across rows* (e.g., grouped means) Each of these functions takes a *data frame* as its first argument (similar to `ggplot()`). - They can also be used in conjunction with each other (e.g., in a *pipeline*). --- # `select()` columns .center[ <img src="dplyr_select.png" width="359" /> ] `select()` allows you to return a *subset* of **columns** of a data frame. - This might not seem like much, but data frames with *many* columns (i.e., hundreds of variables) are not uncommon. --- # `select()` columns The `gapminder` data frame in the `gapminder` package contains data on life expectancy, GDP per capita, and population by country, spanning 1952-2007 (in increments of 5 years). ```r View(gapminder) ``` -- Suppose we want a data frame with *only* the `country`, `year`, and `lifeExp` variables (columns). ```r gapminder_new = gapminder %>% select(country, year, lifeExp) ``` --- # `filter()` rows While `select()` is useful in cutting out unwanted columns (especially if there are *many* columns), we might still need to perform other operations on the data. .center[ <img src="dplyr_filter.png" width="412" /> ] `filter()` allows you to return a *subset* of **rows** of a data frame. - Specify criteria about the values of a variables in your dataset. - `filter()` out only those rows that match that criteria. --- # `filter()` rows Using the same `gapminder` data from earlier, let's obtain data for *only the United States*: ```r gapminder_new = gapminder %>% filter(country == "United States") ``` **Reminder**: While `=` (and `<-`) are used for **assignment**, a double `==` is used to *test for equality*. --- # Filtering on multiple criteria We are not limited to a single criterion when using `filter()`. - We can string together multiple criteria with `&` ("and"). ```r gapminder_new = gapminder %>% filter(country %in% c("United States", "Canada") & year >= 1980) ``` The result of this pipeline is a data frame that includes data on *only* the countries **United States** *or* **Canada**, in **1980 or later**. -- **Note**: We use the `%in%` to check for rows with `country` equal to *either* `"United States"` **or** `"Canada"`. --- # `arrange()` rows The `arrange()` function will *sort the rows* of a data frame by a pre-specified *column*. .center[ <img src="dplyr_arrange.png" width="228" /> ] While we can do this as an *exploratory* step in the `View()` window, `arrange()` allows us to perform additional operations on a custom-sorted data frame. --- # `arrange()` rows ```r gapminder %>% arrange(year, country) ``` ``` ## # A tibble: 1,704 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Albania Europe 1952 55.2 1282697 1601. ## 3 Algeria Africa 1952 43.1 9279525 2449. ## 4 Angola Africa 1952 30.0 4232095 3521. ## 5 Argentina Americas 1952 62.5 17876956 5911. ## 6 Australia Oceania 1952 69.1 8691212 10040. ## 7 Austria Europe 1952 66.8 6927772 6137. ## 8 Bahrain Asia 1952 50.9 120447 9867. ## 9 Bangladesh Asia 1952 37.5 46886859 684. ## 10 Belgium Europe 1952 68 8730405 8343. ## # … with 1,694 more rows ``` --- # `arrange()` rows (in descending order) By default, `arrange()` will sort rows in **ascending order** (least to greatest, or *alphabetically*) based on the column(s) specified. - To sort in **descending order**, use `desc()` around any of the variables specified in `arrange()`. -- Let's take *only data from 2007* and sort by the *largest to smallest* life expectancy: ```r gapminder %>% filter(year == 2007) %>% arrange(desc(lifeExp)) ``` --- # `mutate()` existing columns `mutate()` is a function that defines and inserts *new variables* into a data frame as *columns*. .center[ <img src="dplyr_mutate.png" width="529" /> ] --- # `mutate()` In the `weather` data frame from the `nycflights13` package, the variable `temp` is temperature in **degrees Fahrenheit**. To convert this to **degrees Celcius**, we use the formula: `$$\text{temp in C}=\frac{\text{temp in F}-32}{1.8}$$` -- Using `mutate`, we can create the variable `temp_C` using the existing `temp`: ```r weather = weather %>% mutate(temp_C = (temp - 32)/1.8) View(weather) ``` --- # `mutate()` ```r weather = weather %>% mutate(temp_C = (temp - 32)/1.8) ``` Note that by calling this new data frame `weather`, we *overwrite* the existing one that comes with the `nycflights13` package. - **Why?** We are not *losing* any information; only *adding* new information. - The original `temp` variable is still part of the data frame. --- # `summarize()` with `group_by()` The *verb-named functions* that we have looked at so far have all provided different ways to **manipulate** a data frame. - `select()`, `filter()`, `mutate()`, `rename()`, `arrange()` - The extent of the analysis we can perform *solely by manipulating a data frame* is limited. -- `summarize()` and `group_by()` allow us to make calculations and *comparisons*. - `summarize()` (with or without `group_by()`) collapses the data into rows based on **summary statistics**. .center[ <img src="dplyr_summarize.png" width="342" /> ] --- # `summarize()` The important step in `summarize()` is specifying **how** we want to reduce the data to a single set of summary statistics. - **Examples**: `mean()`, `sd()`, `median()`, `IQR()`, `max()`, `n()`, and many more -- Using the `weather` data frame, let's calculate the **mean** and **standard deviation** of `temp_F`: ```r weather %>% summarize(mean_temp_F = mean(temp, na.rm = TRUE), sd_temp_F = sd(temp, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 2 ## mean_temp_F sd_temp_F ## <dbl> <dbl> ## 1 55.3 17.8 ``` -- **Note**: We added `na.rm = TRUE` to remove `NA` (missing values) from each calculation. - Otherwise, the output for `mean()` and `sd()` would also be `NA`. --- # `summarize()` Given the nature of the `weather` data frame, why might the summary statistics in the previous slide contain *redundant information*? -- ```r weather %>% count(origin) ``` ``` ## # A tibble: 3 × 2 ## origin n ## <chr> <int> ## 1 EWR 8703 ## 2 JFK 8706 ## 3 LGA 8706 ``` Temperature was recorded each hour in 2013 at *three different NYC airports*. - Presumably, the temperature will be similar at each of the three airports at any given hour. -- **Also**, there are definitely *seasonal/monthly* trends in temperature, especially in a state like New York. --- # `group_by()` rows This is where `group_by()` comes in! The `summarize()` function is *almost always* used in conjunction with `group_by()`. .center[ <img src="dplyr_group_by.png" width="70%" /> ] We just used `summarize()` to calculate the **mean** and **standard deviation** of New York temperature in 2013. - But what if we want to calculate these summary statistics *for each month*? - We can use `group_by()` in the pipeline to generate *aggregate* summary statistics! --- # `group_by()` Before running the code below, *what do you think it does*? ```r weather_temp_month = weather %>% filter(origin == "JFK") %>% # Just focus on one NYC airport group_by(month) %>% summarize(mean_temp_F = mean(temp, na.rm = TRUE), sd_temp_F = sd(temp, na.rm = TRUE)) weather_temp_month ``` -- Because **temperature** usually varies by *season*, or *month*, it makes sense to calculate monthly summary statistics. - We could've also grouped by `origin` (i.e., `group_by(month, origin)`), but the monthly average temperature is probably very similar at the three airports. --- # `group_by()` The code on the previous slide is *identical* to the code we used when introducing `summarize()`... - Except we added the first step `group_by(month)`, which does exactly what it sounds like: it **groups by** *month* before calculating summary statistics. **Note**: We never use the `group_by()` operation by itself. We always combine it with a function that performs a calculation, such as `summarize()`. For example, run the following. You'll notice that nothing really changes from the initial data frame. ```r weather %>% group_by(month) ``` --- # Recap: The S.S. FAM .pull-left[ - `select()`: choose columns by name - `summarize()` (usually with `group_by()`): reduce variables to summary values - `filter()`: pick a subset of rows matching criteria - `arrange()`: reorder rows - `mutate()`: add new variables ] .pull-right[ <img src="boat.jpg" width="4000" /> ] There are a *handful* of other functions we'll cover, but you can do a lot with these five!