CEMA 0907: Statistics in the Real World

class: center, middle, inverse, title-slide

# CEMA 0907: Statistics in the Real World
## Data Visualization
### Anthony Scotina

---

# (Good) Data Viz in R?

**ggplot2**

.center[
<img src="ggplot2_hex.png" width="241" />
]

---

# Needed Packages

```r
library(tidyverse)
library(nycflights13)
```

- If you don't have any of these packages, install them first!

---

# Statistical Graphics

.center[
<img src="ggplot2_plot1.png" width="960" />
]

---

# Statistical Graphics

.center[
<img src="ggplot2_plot2.png" width="2048" />
]

---

# Statistical Graphics

.center[
<img src="ggplot2_plot3.jpg" width="75%" />
]

---

# The Grammar of Graphics

.pull-left[
<img src="grammar_paper.png" width="1419" />
]

.pull-right[
<img src="hadley.jpg" width="683" />
]

A theoretical framework for data visualization
- *Idea*: You construct plots the same way that you construct sentences, by combining many different elements.

---

# What is a statistical graphic?

The **grammar of graphics** defines a "statistical graphic" as the following:
- **statistical graphic**: a mapping of `data` variables to `aes()`thetic attributes of `geom_`etric objects

Let's look back at the first statistical graphic on area vs. population in US cities.

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> data </th>
   <th style="text-align:left;"> aes </th>
   <th style="text-align:left;"> geom </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Area </td>
   <td style="text-align:left;"> x </td>
   <td style="text-align:left;"> point, smooth </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Population </td>
   <td style="text-align:left;"> y </td>
   <td style="text-align:left;"> point, smooth </td>
  </tr>
  <tr>
   <td style="text-align:left;"> State </td>
   <td style="text-align:left;"> color </td>
   <td style="text-align:left;"> point </td>
  </tr>
  <tr>
   <td style="text-align:left;"> PopDensity </td>
   <td style="text-align:left;"> size </td>
   <td style="text-align:left;"> point </td>
  </tr>
</tbody>
</table>
]

---

# Components of the Grammar

We can break a graphic into three essential components:

1. `data`: the dataset composed of variables that we *map*
2. `geom`: the shape or visual representation of our data. 
    - `geom_` ... point, line, boxplot, histogram, bar, etc.
3. `aes`: aesthetic attributes of the geometric object. 
    - x/y position, color, shape, and size
    
--

.center[
<img src="grammar-of-graphics.png" width="4405" />
]

---

# The `mtcars` Data Frame

```r
head(mtcars, 10) # Show the first 10 rows of mtcars
```

```
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
```

```r
?mtcars
```

---

# Basic Template

How can we make a plot of automobile `wt` (weight, in 1000 lbs) versus `mpg` (miles per gallon)?

```r
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg))
```

- What happened? What do you think we're missing?

---

# Basic Template

```r
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + 
  geom_point()
```

---

class: middle, center

# Five Named Graphs: The 5NG

---

# 5NG#1: Scatterplots

**Scatterplots**, also called **bivariate plots**, allow you to visualize the relationship between two *numerical* variables.

Let's take another look at the `flights` dataset from the `nycflights13` package.

- **Question**: What do you think is the relationship between flight **departure delay** and **arrival delay**?
    - If a flight is *delayed*, does it *arrive* at a later time than planned? Or does the flight speed up to accommodate?
    
--

Back in 2019, I attended a conference in Colorado. I flew there using **Frontier Airlines**. So let's `filter` the flights dataset to look at only Frontier Airlines (carrier code: F9) flights:

```r
frontier = flights %>%
  filter(carrier == "F9")
```

- We'll cover the specific syntax in this code soon. For now, just run it!

---

# 5NG#1: Scatterplots

```r
ggplot(data = frontier, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point()
```

---

# 5NG#1: Scatterplots

```r
ggplot(data = frontier, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point()
```

**Let's break this down...**

Within the `ggplot()` function, we specify two of the components of the Grammar of Graphics as arguments (i.e. inputs):

1. The data frame to be `frontier` by setting `data = frontier`.

2. The `aes`thetic mapping by setting `aes(x = dep_delay, y = arr_delay)`. Specifically:
    - the variable `dep_delay` maps to the `x` position aesthetic
    - the variable `arr_delay` maps to the `y` position aesthetic
    
- We add a layer to the `ggplot()` function call using the `+` sign. The layer in question specifies the third component of the grammar: the `geom`etric object. In this case the geometric object are points, set by specifying `geom_point()`.

---

# A Note on Overplotting

Go back to the original scatterplot of `dep_delay` versus `arr_delay`. There is a large clutter of points near 0, indicating no delays in departure or arrival of the flight.

**The problem**: It is difficult to tell how many points are plotted when there are many clustered around the same values!

**The solution**: Change the *transparancy* of the points by using:

```r
ggplot(data = frontier, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.2)
```

- By default, the `alpha` option in `geom_point()` is set to `1`.
    - `alpha = 1` means 100% opaque. 
    - `alpha = 0` means 100% transparent. 
    
---

# 5NG#2: Linegraphs

**Linegraphs** are similar to scatterplots. They show the relationship between two *numerical* variables.

- However, linegraphs are useful when the `x` variable is *sequential*, or *ordered*. 
- One of the most common ordered numerical variables is time.

As a first example, let's take a look at the `weather` dataset in the `nycflights13` package. Specifically, let's look at temperature at the JFK airport in New York, between November 1 and November 18:

```r
jfk_nov = weather %>%
  filter(origin == "JFK" & month == 11 & day <= 18)
```

(Again, don't worry about the syntax yet. Just copy and run!)

---

# 5NG#2: Linegraphs

In the **Grammar of Graphics**, the *only* difference between a scatterplot and a linegraph is the `geom`.

Let’s plot a linegraph of hourly temperatures (`x = time_hour`, `y = temp`) in `jfk_nov` by using `geom_line()` instead of `geom_point()` like we did for scatterplots:

---

# 5NG#2: Linegraphs

```r
ggplot(data = jfk_nov, mapping = aes(x = time_hour, y = temp)) + 
  geom_line()
```

---

# 5NG#2: Linegraphs

```r
ggplot(data = jfk_nov, mapping = aes(x = time_hour, y = temp)) + 
  geom_line()
```

**Let's break this down...**

There isn't as much to break down this time! The only major difference between this code and the code for scatterplots is the `geom_line()` object. 
- You could easily create a scatterplot between these two variables by using `geom_point()`, It just wouldn't look very nice. (Try it!)

---

# 5NG#3: Histograms

**Histograms** provide a visualization of the *distribution* of a single *numerical* variable. 
- You need only specific an `x` variable in a histogram. 
- By default, the `y` variable is *count*.

Suppose we are interested in the *distribution* of hourly temperature recordings in New York.

**Histograms** share the following information:
- What is the smallest and largest temperatures, and how often are they observed?
- What is the "center" temperature?
- How are the temperatures spread out?
- What are frequent and infrequent values?
- Is there any skewness?

---

# 5NG#3: Histograms

```r
ggplot(data = weather, mapping = aes(x = temp)) + 
  geom_histogram()
```

---

# 5NG#3: Histograms

Before we even discuss the histogram, **always use** `color = "white"` **in** `geom_histogram()`!!!

```r
ggplot(data = weather, mapping = aes(x = temp)) + 
  geom_histogram(color = "white")
```

---

# 5NG#3: Histograms

.pull-left[
![](02-Data_Visualization_files/figure-html/unnamed-chunk-24-1.png)
]

.pull-right[
What do we notice about the histogram?

- The **middle** temperatures are around 55-60 degrees Fahrenheit. 
- The **range** is from ~10 degrees to ~100 degrees. 
- There are **two prominent peaks** at ~30 degrees and ~70 degrees. 
    - What do you think could account for these two peaks?
]

---

# Changing the color of histograms

The `color` argument changes the *outline* of each bar in the histogram.

If you want to change the color of each *bar*, use the `fill` argument. 
- Try running this!

```r
ggplot(data = weather, mapping = aes(x = temp)) + 
  geom_histogram(color = "white", fill = "red")
```

- R has **many colors**. See [this](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf).

---

# Facets

**Faceting** is used when we’d like to split a particular visualization of variables by another variable.

For example, we agreed that the two prominent peaks in the histogram of temperature might be due to seasons (colder in winter, warmer in summer). 
- Therefore, let's construct multiple histograms of `temp`, one for each `month`:

```r
ggplot(data = weather, mapping = aes(x = temp)) + 
  geom_histogram(color = "white") + 
  facet_wrap( ~ month)
```

---

# Facets

```r
ggplot(data = weather, mapping = aes(x = temp)) + 
  geom_histogram(color = "white") + 
  facet_wrap( ~ month)
```

---

# Facets

**Facets** form *another layer* to our grammar of graphics.

- After we add the `geom`, we have the option of adding a `facet` if we want separate figures for levels of a variable.

The syntax is `facet_wrap( ~ VARIABLE NAME)`

---

# 5NG#4: Boxplots

**Boxplots**, like **histograms**, show the *distribution* of a *numerical* variable. 
- However, boxplots are constructed using information provided by a **five-number summary**.

**Five-number summary**: Minimum, 1st quartile (25th percentile), Median, 3rd quartile (75th percentile), Maximum

In R, a five-number summary of any numerical variable can be found using the `summary()` function:

```r
summary(jfk_nov$temp)
```

```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  28.94   44.06   48.92   49.20   55.94   66.92 
```

(Recall that we can use the `$` operator to view individual variables in a data frame!)

---

# 5NG#4: Boxplots

---

# 5NG#4: Boxplots

---

# 5NG#4: Boxplots

**What does the boxplot tell us?**

Between November 1 and November 18 at JFK Airport in New York...
- 25% of points fall below the bottom edge of the box, which is the **first quartile** of 44.06 degrees. 
- 75% of points fall above the top edge of the box, which is the **third quartile** of 55.94 degrees. 
- 50% of points fall between the first and third quartiles, or between 44.06 and 55.94 degrees. 
    - This is the **interquartile range (IQR)**.

---

# 5NG#4: Boxplots

**How can we make a boxplot?**

Same as before, we just need to change the `geom_` object.

```r
ggplot(data = jfk_nov, mapping = aes(y = temp)) + 
  geom_boxplot()
```

(Notice we also use the `y` variable here, not the `x`!)

---

# Side-by-side Boxplots

Boxplots are more interesting when you compare several side-by-side.

Let's use the `weather` dataset to compare `temp` by `month`, as we did before with histograms.

```r
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + 
  geom_boxplot()
```

---

# Side-by-side Boxplots

- In the `weather` dataset, R thinks `month` is **numerical** (since it appears in the data as numbers) when it is really **categorical**. We convert `month` to categorical using `factor(month)`.

---

# Side-by-side Boxplots

- The dots representing values falling outside the whiskers are called outliers. These can be thought of as anomalous values.

---

# 5NG#5: Barplots

**Barplots** provide a visualization of the distribution of a *categorical variable*.

- The x-axis shows *levels* of the categorical variable. 
- The y-axis shows the *count* of observations within each level.

**One complication**: Are your data *counted* or *pre-counted*?

---

# Counted vs. Pre-counted Categories

Consider two data frames of the same categorical variable:
- `fav.plot`: Which do you think is better: barplots or pie charts?

---

# Counted vs. Pre-counted Categories

**Counted**

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> fav.plot </th>
   <th style="text-align:right;"> count </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Pie Chart </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Barplot </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
</tbody>
</table>
]

**Pre-counted**

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> fav.plot </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Pie Chart </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Barplot </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Barplot </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Barplot </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Barplot </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Pie Chart </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Barplot </td>
  </tr>
</tbody>
</table>
]

---

# Barplots for Pre-counted Data

- Use `geom_bar()`:

```r
ggplot(data = tab.pre, mapping = aes(x = fav.plot)) + 
  geom_bar()
```

---

# Barplots for Counted Data

- Use `geom_col()`:

```r
ggplot(data = tab.count, mapping = aes(x = fav.plot, y = count)) + 
  geom_col()
```

---

# Barplot of `carrier`

Using the `flights` data frame, create a boxplot of the `carrier` variable. 
- **Hint**: The `carrier` variable is **pre-counted**, so use `geom_bar()`.

**Solution**

```r
ggplot(data = flights, mapping = aes(x = carrier)) + 
  geom_bar()
```

---

# Multiple Categorical Variables

Suppose we want to make different-colored bars for different airports (`origin`).

```r
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + 
  geom_bar()
```