CEMA 0907: Statistics in the Real World

class: center, middle, inverse, title-slide

# CEMA 0907: Statistics in the Real World
## Sampling
### Anthony Scotina

---

# Needed Packages

```r
library(tidyverse)
library(moderndive)
```

---

# Virtual Sampling

We will use R to perform a **computer simulation**, in which we will sample from a *virtual environment*.

The `bowl` data frame in the `moderndive` R package contains data on a "population" (or a *virtual bowl*) of red and white balls. 
- We will use R to collect *virtual samples*.

---

# A "Population"

```r
bowl
```

```
# A tibble: 2,400 × 2
   ball_ID color
     <int> <chr>
 1       1 white
 2       2 white
 3       3 white
 4       4 red  
 5       5 white
 6       6 white
 7       7 red  
 8       8 white
 9       9 red  
10      10 white
# … with 2,390 more rows
```

- This tells us that there are 2,400 total balls, with each *equally likely* to be selected in a virtual sample.

---

# One Virtual Sample

To collect a *virtual sample* of size `$n=50$`, we will use the `rep_sample_n()` function in the `moderndive` package. 
- In `rep_sample_n()`, the `rep` stands for **repeat**, and the `n` refers to the **size** of the *virtual sample*.

```r
virtual_sample = rep_sample_n(bowl, size = 50)
```

```r
View(virtual_sample)
```

---

# One Virtual Sample

If you followed along, you might notice that your `ball_ID` variable contains different numbers from mine. 
- This is because we each used R's **random number generator** when we ran `rep_sample_n()`, so our samples are all different!

Next, calculate the proportion of balls in your *virtual sample* that are red:

```r
virtual_sample %>%
  summarize(prop = sum(color == "red")/50)
```

```
# A tibble: 1 × 2
  replicate  prop
      <int> <dbl>
1         1  0.34
```

---

# One Virtual Sample

In my *virtual sample*, there were 34% red balls.

Now let's find the proportion of the 2,400 balls in the `bowl` data frame that are red:

```r
bowl %>%
  summarize(prop = sum(color == "red")/2400)
```

```
# A tibble: 1 × 1
   prop
  <dbl>
1 0.375
```

- I was close! How close were you?

---

# Many Virtual Samples

While I had each of you take a *virtual sample*, there is an easy way for each of us to *simulate* many virtual samples, using `rep_sample_n()`.

Here's how you can simulate **30 samples**, each of size `$n=50$`, from the "population" of 2,400 balls:

```r
virtual_samples_30 = rep_sample_n(bowl, size = 50, reps = 30)
```

The syntax is almost identical to our single *virtual sample*. 
- However, we add the `reps` argument, which indicates that we want to **repeat** the sampling 30 times.

---

# Many Virtual Samples

Notice that when we `View(virtual_samples_30)`, the first 50 rows of `replicate` are equal to `1`, the next 50 rows equal to `2` and so on, until you reach replicate `30`. 
- Therefore, there are `$30\cdot 50=1500$` rows in this data frame.

We need to calculate the proportion of red balls **for each replicate**. How can we calculate the proportion of red balls in each group?

---

# Summarizing Proportion by Group

We can use the `dplyr` syntax!

```r
virtual_prop_red = virtual_samples_30 %>%
  group_by(replicate) %>%
  summarize(prop = sum(color == "red")/n())
```

```r
View(virtual_prop_red)
```

Using `View(virtual_prop_red)`, we see that there are now *30 columns*: each column gives a summary measure for the corresponding **replicate**.

---

# Practice

Let's construct a **histogram** of the 30 `prop_red` values in `virtual_prop_red`. 
 
- What do you notice about the **distribution**?

```r
ggplot(data = virtual_prop_red, aes(x = prop)) + 
  geom_histogram(color = "white", binwidth = 0.05)
```

---

# 1,000 Virtual Samples

With each of our 30 *virtual samples*, we can see that there is **variation** between each sample. 
- But we could get an even better idea of **sampling variability** if we use *more* than 30 replicates.

Instead of `reps = 30`, now let's try `reps = 1000` for 1,000 *virtual samples* of size 50!

```r
virtual_samples_1000 = rep_sample_n(bowl, size = 50, reps = 1000)
```

**Note**: You could try `View(virtual_samples_1000)`, but there are `$1000\cdot 50=50000$` rows!

---

# Practice

Perform the *exact same* calculations as with `virtual_samples_30`, but now with `virtual_samples_1000`:

1. Use `dplyr` to calculate the proportion of red balls in each **replicate**.

2. Construct a **histogram** of the 1,000 `prop_red` values (adapt the code from several slides back). What do you notice about the shape of the histogram?

---

# Solution

```r
virtual_prop_red_1000 = virtual_samples_1000 %>%
  group_by(replicate) %>%
  summarize(prop = sum(color == "red")/n())
```

```r
ggplot(data = virtual_prop_red_1000, aes(x = prop)) + 
  geom_histogram(color = "white", binwidth = 0.05)
```

---

# 1,000 Virtual Samples

.center[
![](07-Sampling_files/figure-html/unnamed-chunk-17-1.png)
]

---

# Many Large Virtual Samples

So far, we have been controlling *how many* **samples of size 50** we take from the "population."
- But we can also control the **sample size**.

Repeat the same exercise as before using 1,000 replicates, but this time use a **sample size of 100**.

```r
virtual_samples_100 = rep_sample_n(bowl, size = 100, reps = 1000)

virtual_prop_red_100 = virtual_samples_100 %>%
  group_by(replicate) %>%
  summarize(prop = sum(color == "red")/n())
```

---

# 1,000 Virtual Samples of size 100

.center[
<img src="07-Sampling_files/figure-html/unnamed-chunk-19-1.png" width="50%" />
]

These types of *distributions* are very special: they are called **sampling distributions**. 
- A **sampling distribution** is a distribution of **sample statistics**.

---

# Sampling Distributions

A **sampling distribution** is a distribution of **sample statistics**.

The sample statistics in this case are **sample proportions**, `$\hat{p}$`.

- In each sample, `$\hat{p}$` is the proportion of balls that are red.

- We took *many* samples, so we were able to plot a histogram of the *many* `$\hat{p}$`.

---

# Mean, SD for n = 10

```r
virtual_samples_10 = rep_sample_n(bowl, size = 10, reps = 1000)

virtual_prop_red_10 = virtual_samples_10 %>%
  group_by(replicate) %>%
  summarize(prop = sum(color == "red")/n())

summary(virtual_prop_red_10$prop)
```

```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.3000  0.4000  0.3706  0.5000  0.9000 
```

---

# Mean, SD for n = 50

```r
virtual_samples_50 = rep_sample_n(bowl, size = 50, reps = 1000)

virtual_prop_red_50 = virtual_samples_50 %>%
  group_by(replicate) %>%
  summarize(prop = sum(color == "red")/n())

summary(virtual_prop_red_50$prop)
```

```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2000  0.3200  0.3800  0.3778  0.4200  0.6400 
```

---

# Mean, SD for n = 100

```r
virtual_samples_100 = rep_sample_n(bowl, size = 100, reps = 1000)

virtual_prop_red_100 = virtual_samples_100 %>%
  group_by(replicate) %>%
  summarize(prop = sum(color == "red")/n())

summary(virtual_prop_red_100$prop)
```

```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.230   0.340   0.380   0.377   0.410   0.590 
```

---

# Summary Statistics

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> sample.size </th>
   <th style="text-align:right;"> mean </th>
   <th style="text-align:right;"> sd </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 0.367 </td>
   <td style="text-align:right;"> 0.1500 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 0.377 </td>
   <td style="text-align:right;"> 0.0680 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 100 </td>
   <td style="text-align:right;"> 0.375 </td>
   <td style="text-align:right;"> 0.0471 </td>
  </tr>
</tbody>
</table>
]

As the sample size **increases**, the standard deviation **decreases**.
- These types of standard deviations are so special that they get their own name: **standard error**. 
    - **Standard errors** quantify the effect of sampling variation induced on our estimates. 
    
---

# Summary Statistics

**Remember**: The *true* "population" proportion is `$p=0.375$`.

A certain *theorem* states that, as the sample size gets *larger*...

1. The **mean** of the sampling distribution *converges* to `$p$`.

2. The **standard deviation** of the sampling distribution *converges* to `$\sqrt{p(1-p)/n}$`.

3. The **shape** of the sampling distribution is *Normal*. 
    
---

# Sampling Distribution for the Sample Mean

This exercise can also be done with **means**.

The `pennies` dataset in the `moderndive` package contains the ages of 800 pennies, measured in 2011. 
- Let's treat this as a "population."

```r
View(pennies)
```

Using `rep_sample_n()`, take 1,000 replicates of *size 10* from this "population":

```r
virtual_pennies_10 = rep_sample_n(pennies, size = 10, reps = 1000)
```

Now, calculate the **mean** penny age within each replicate:

```r
virtual_mean_10 = virtual_pennies_10 %>%
  group_by(replicate) %>%
  summarize(mean_age = mean(age_in_2011))
```

---

# Practice

Repeat this exercise for `$n=50$` and `$n=100$`.

**Solution**

```r
virtual_pennies_50 = rep_sample_n(pennies, size = 50, reps = 1000)
virtual_mean_50 = virtual_pennies_50 %>%
  group_by(replicate) %>%
  summarize(mean_age = mean(age_in_2011))
```

```r
virtual_pennies_100 = rep_sample_n(pennies, size = 100, reps = 1000)
virtual_mean_100 = virtual_pennies_100 %>%
  group_by(replicate) %>%
  summarize(mean_age = mean(age_in_2011))
```

---

# Summary Statistics

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> sample.size </th>
   <th style="text-align:right;"> mean </th>
   <th style="text-align:right;"> sd </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 21.09 </td>
   <td style="text-align:right;"> 3.98 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 21.20 </td>
   <td style="text-align:right;"> 1.70 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 100 </td>
   <td style="text-align:right;"> 21.13 </td>
   <td style="text-align:right;"> 1.44 </td>
  </tr>
</tbody>
</table>
]

As the sample size **increases**, the standard error **decreases**.

```r
summary(pennies$age_in_2011)
```

```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   11.00   20.00   21.15   30.00   63.00 
```

As the sample size **increases**, the *mean of the sampling distribution* gets closer to the *true mean*.

---

# Summary Statistics

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> sample.size </th>
   <th style="text-align:right;"> mean </th>
   <th style="text-align:right;"> sd </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 21.09 </td>
   <td style="text-align:right;"> 3.98 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 21.20 </td>
   <td style="text-align:right;"> 1.70 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 100 </td>
   <td style="text-align:right;"> 21.13 </td>
   <td style="text-align:right;"> 1.44 </td>
  </tr>
</tbody>
</table>
]

**Remember**: The *true* "population" mean is `$\mu=21.15$` with a standard deviation of `$\sigma=12.44$`.

A certain *theorem* states that, as the sample size gets *larger*...

1. The **mean** of the sampling distribution *converges* to `$\mu$`.

2. The **standard deviation** of the sampling distribution *converges* to `$\sigma/\sqrt{n}$`.

3. The **shape** of the sampling distribution is *Normal*.

---

# Why do we sample?

In both the *virtual* simulations and in **real life**, our goal is the same:
- Estimate the *true* proportion, by extracting samples from a *population*.

Additionally, we discussed two key concepts:

1. The effect of **sampling variation** on our estimates (i.e., `$\hat{p}$` or `$\bar{x}$`).

2. The effect of **sample size** on sampling variation.

---

# Central Limit Theorem

What we have illustrated throughout class today is one of the most important theorems in all of statistics.

**The Central Limit Theorem (CLT)**: As the sample size `$n$` gets larger, the *sampling distribution* of the sample mean (or sample proportion) becomes more *bell-shaped* (i.e., more **Normally** distributed and more narrow).

Specifically, we can write the following: `$$\bar{x}\sim \text{Normal}\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$`
and `$$\hat{p}\sim \text{Normal}\left(p, \sqrt{\frac{p(1-p)}{n}}\right)$$`