class: center, middle, inverse, title-slide # CEMA 0907: Statistics in the Real World ## Data Visualization ### Anthony Scotina --- # (Good) Data Viz in R? **ggplot2** .center[ <img src="ggplot2_hex.png" width="241" /> ] --- # Needed Packages ```r library(tidyverse) library(nycflights13) ``` - If you don't have any of these packages, install them first! --- # Statistical Graphics .center[ <img src="ggplot2_plot1.png" width="960" /> ] --- # Statistical Graphics .center[ <img src="ggplot2_plot2.png" width="2048" /> ] --- # Statistical Graphics .center[ <img src="ggplot2_plot3.jpg" width="75%" /> ] --- # The Grammar of Graphics .pull-left[ <img src="grammar_paper.png" width="1419" /> ] .pull-right[ <img src="hadley.jpg" width="683" /> ] A theoretical framework for data visualization - *Idea*: You construct plots the same way that you construct sentences, by combining many different elements. --- # What is a statistical graphic? The **grammar of graphics** defines a "statistical graphic" as the following: - **statistical graphic**: a mapping of `data` variables to `aes()`thetic attributes of `geom_`etric objects -- Let's look back at the first statistical graphic on area vs. population in US cities. .center[ <table> <thead> <tr> <th style="text-align:left;"> data </th> <th style="text-align:left;"> aes </th> <th style="text-align:left;"> geom </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Area </td> <td style="text-align:left;"> x </td> <td style="text-align:left;"> point, smooth </td> </tr> <tr> <td style="text-align:left;"> Population </td> <td style="text-align:left;"> y </td> <td style="text-align:left;"> point, smooth </td> </tr> <tr> <td style="text-align:left;"> State </td> <td style="text-align:left;"> color </td> <td style="text-align:left;"> point </td> </tr> <tr> <td style="text-align:left;"> PopDensity </td> <td style="text-align:left;"> size </td> <td style="text-align:left;"> point </td> </tr> </tbody> </table> ] --- # Components of the Grammar We can break a graphic into three essential components: 1. `data`: the dataset composed of variables that we *map* 2. `geom`: the shape or visual representation of our data. - `geom_` ... point, line, boxplot, histogram, bar, etc. 3. `aes`: aesthetic attributes of the geometric object. - x/y position, color, shape, and size -- .center[ <img src="grammar-of-graphics.png" width="4405" /> ] --- # The `mtcars` Data Frame ```r head(mtcars, 10) # Show the first 10 rows of mtcars ``` ``` mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ``` ```r ?mtcars ``` --- # Basic Template How can we make a plot of automobile `wt` (weight, in 1000 lbs) versus `mpg` (miles per gallon)? -- ```r ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) ``` - What happened? What do you think we're missing? --- # Basic Template ```r ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + geom_point() ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-14-1.png" width="50%" /> --- class: middle, center # Five Named Graphs: The 5NG --- # 5NG#1: Scatterplots **Scatterplots**, also called **bivariate plots**, allow you to visualize the relationship between two *numerical* variables. Let's take another look at the `flights` dataset from the `nycflights13` package. - **Question**: What do you think is the relationship between flight **departure delay** and **arrival delay**? - If a flight is *delayed*, does it *arrive* at a later time than planned? Or does the flight speed up to accommodate? -- Back in 2019, I attended a conference in Colorado. I flew there using **Frontier Airlines**. So let's `filter` the flights dataset to look at only Frontier Airlines (carrier code: F9) flights: ```r frontier = flights %>% filter(carrier == "F9") ``` - We'll cover the specific syntax in this code soon. For now, just run it! --- # 5NG#1: Scatterplots ```r ggplot(data = frontier, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-16-1.png" width="50%" /> --- # 5NG#1: Scatterplots ```r ggplot(data = frontier, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() ``` **Let's break this down...** Within the `ggplot()` function, we specify two of the components of the Grammar of Graphics as arguments (i.e. inputs): 1. The data frame to be `frontier` by setting `data = frontier`. 2. The `aes`thetic mapping by setting `aes(x = dep_delay, y = arr_delay)`. Specifically: - the variable `dep_delay` maps to the `x` position aesthetic - the variable `arr_delay` maps to the `y` position aesthetic - We add a layer to the `ggplot()` function call using the `+` sign. The layer in question specifies the third component of the grammar: the `geom`etric object. In this case the geometric object are points, set by specifying `geom_point()`. --- # A Note on Overplotting Go back to the original scatterplot of `dep_delay` versus `arr_delay`. There is a large clutter of points near 0, indicating no delays in departure or arrival of the flight. **The problem**: It is difficult to tell how many points are plotted when there are many clustered around the same values! -- **The solution**: Change the *transparancy* of the points by using: ```r ggplot(data = frontier, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) ``` - By default, the `alpha` option in `geom_point()` is set to `1`. - `alpha = 1` means 100% opaque. - `alpha = 0` means 100% transparent. --- # 5NG#2: Linegraphs **Linegraphs** are similar to scatterplots. They show the relationship between two *numerical* variables. - However, linegraphs are useful when the `x` variable is *sequential*, or *ordered*. - One of the most common ordered numerical variables is time. As a first example, let's take a look at the `weather` dataset in the `nycflights13` package. Specifically, let's look at temperature at the JFK airport in New York, between November 1 and November 18: ```r jfk_nov = weather %>% filter(origin == "JFK" & month == 11 & day <= 18) ``` (Again, don't worry about the syntax yet. Just copy and run!) --- # 5NG#2: Linegraphs In the **Grammar of Graphics**, the *only* difference between a scatterplot and a linegraph is the `geom`. Let’s plot a linegraph of hourly temperatures (`x = time_hour`, `y = temp`) in `jfk_nov` by using `geom_line()` instead of `geom_point()` like we did for scatterplots: --- # 5NG#2: Linegraphs ```r ggplot(data = jfk_nov, mapping = aes(x = time_hour, y = temp)) + geom_line() ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-20-1.png" width="55%" /> --- # 5NG#2: Linegraphs ```r ggplot(data = jfk_nov, mapping = aes(x = time_hour, y = temp)) + geom_line() ``` **Let's break this down...** There isn't as much to break down this time! The only major difference between this code and the code for scatterplots is the `geom_line()` object. - You could easily create a scatterplot between these two variables by using `geom_point()`, It just wouldn't look very nice. (Try it!) --- # 5NG#3: Histograms **Histograms** provide a visualization of the *distribution* of a single *numerical* variable. - You need only specific an `x` variable in a histogram. - By default, the `y` variable is *count*. -- Suppose we are interested in the *distribution* of hourly temperature recordings in New York. **Histograms** share the following information: - What is the smallest and largest temperatures, and how often are they observed? - What is the "center" temperature? - How are the temperatures spread out? - What are frequent and infrequent values? - Is there any skewness? --- # 5NG#3: Histograms ```r ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram() ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-22-1.png" width="50%" /> --- # 5NG#3: Histograms Before we even discuss the histogram, **always use** `color = "white"` **in** `geom_histogram()`!!! ```r ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = "white") ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-23-1.png" width="45%" /> --- # 5NG#3: Histograms .pull-left[ ![](02-Data_Visualization_files/figure-html/unnamed-chunk-24-1.png)<!-- --> ] .pull-right[ What do we notice about the histogram? - The **middle** temperatures are around 55-60 degrees Fahrenheit. - The **range** is from ~10 degrees to ~100 degrees. - There are **two prominent peaks** at ~30 degrees and ~70 degrees. - What do you think could account for these two peaks? ] --- # Changing the color of histograms The `color` argument changes the *outline* of each bar in the histogram. If you want to change the color of each *bar*, use the `fill` argument. - Try running this! ```r ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = "white", fill = "red") ``` - R has **many colors**. See [this](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf). --- # Facets **Faceting** is used when we’d like to split a particular visualization of variables by another variable. For example, we agreed that the two prominent peaks in the histogram of temperature might be due to seasons (colder in winter, warmer in summer). - Therefore, let's construct multiple histograms of `temp`, one for each `month`: ```r ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = "white") + facet_wrap( ~ month) ``` --- # Facets ```r ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = "white") + facet_wrap( ~ month) ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-27-1.png" width="50%" /> --- # Facets **Facets** form *another layer* to our grammar of graphics. - After we add the `geom`, we have the option of adding a `facet` if we want separate figures for levels of a variable. The syntax is `facet_wrap( ~ VARIABLE NAME)` --- # 5NG#4: Boxplots **Boxplots**, like **histograms**, show the *distribution* of a *numerical* variable. - However, boxplots are constructed using information provided by a **five-number summary**. **Five-number summary**: Minimum, 1st quartile (25th percentile), Median, 3rd quartile (75th percentile), Maximum In R, a five-number summary of any numerical variable can be found using the `summary()` function: ```r summary(jfk_nov$temp) ``` ``` Min. 1st Qu. Median Mean 3rd Qu. Max. 28.94 44.06 48.92 49.20 55.94 66.92 ``` (Recall that we can use the `$` operator to view individual variables in a data frame!) --- # 5NG#4: Boxplots <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-29-1.png" width="65%" /> --- # 5NG#4: Boxplots <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-30-1.png" width="65%" /> --- # 5NG#4: Boxplots **What does the boxplot tell us?** Between November 1 and November 18 at JFK Airport in New York... - 25% of points fall below the bottom edge of the box, which is the **first quartile** of 44.06 degrees. - 75% of points fall above the top edge of the box, which is the **third quartile** of 55.94 degrees. - 50% of points fall between the first and third quartiles, or between 44.06 and 55.94 degrees. - This is the **interquartile range (IQR)**. --- # 5NG#4: Boxplots **How can we make a boxplot?** -- Same as before, we just need to change the `geom_` object. ```r ggplot(data = jfk_nov, mapping = aes(y = temp)) + geom_boxplot() ``` (Notice we also use the `y` variable here, not the `x`!) --- # Side-by-side Boxplots Boxplots are more interesting when you compare several side-by-side. Let's use the `weather` dataset to compare `temp` by `month`, as we did before with histograms. ```r ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + geom_boxplot() ``` --- # Side-by-side Boxplots <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-33-1.png" width="50%" /> - In the `weather` dataset, R thinks `month` is **numerical** (since it appears in the data as numbers) when it is really **categorical**. We convert `month` to categorical using `factor(month)`. --- # Side-by-side Boxplots <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-34-1.png" width="50%" /> - The dots representing values falling outside the whiskers are called outliers. These can be thought of as anomalous values. --- # 5NG#5: Barplots **Barplots** provide a visualization of the distribution of a *categorical variable*. - The x-axis shows *levels* of the categorical variable. - The y-axis shows the *count* of observations within each level. -- **One complication**: Are your data *counted* or *pre-counted*? --- # Counted vs. Pre-counted Categories Consider two data frames of the same categorical variable: - `fav.plot`: Which do you think is better: barplots or pie charts? --- # Counted vs. Pre-counted Categories **Counted** .center[ <table> <thead> <tr> <th style="text-align:left;"> fav.plot </th> <th style="text-align:right;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Pie Chart </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Barplot </td> <td style="text-align:right;"> 5 </td> </tr> </tbody> </table> ] -- **Pre-counted** .center[ <table> <thead> <tr> <th style="text-align:left;"> fav.plot </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Pie Chart </td> </tr> <tr> <td style="text-align:left;"> Barplot </td> </tr> <tr> <td style="text-align:left;"> Barplot </td> </tr> <tr> <td style="text-align:left;"> Barplot </td> </tr> <tr> <td style="text-align:left;"> Barplot </td> </tr> <tr> <td style="text-align:left;"> Pie Chart </td> </tr> <tr> <td style="text-align:left;"> Barplot </td> </tr> </tbody> </table> ] --- # Barplots for Pre-counted Data - Use `geom_bar()`: ```r ggplot(data = tab.pre, mapping = aes(x = fav.plot)) + geom_bar() ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-37-1.png" width="50%" /> --- # Barplots for Counted Data - Use `geom_col()`: ```r ggplot(data = tab.count, mapping = aes(x = fav.plot, y = count)) + geom_col() ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-38-1.png" width="50%" /> --- # Barplot of `carrier` Using the `flights` data frame, create a boxplot of the `carrier` variable. - **Hint**: The `carrier` variable is **pre-counted**, so use `geom_bar()`. -- **Solution** ```r ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() ``` --- # Multiple Categorical Variables Suppose we want to make different-colored bars for different airports (`origin`). ```r ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar() ``` <img src="02-Data_Visualization_files/figure-html/unnamed-chunk-40-1.png" width="45%" />