class: center, middle, inverse, title-slide # CEMA 0907: Statistics in the Real World ## Sampling ### Anthony Scotina --- # Needed Packages ```r library(tidyverse) library(moderndive) ``` --- # Virtual Sampling We will use R to perform a **computer simulation**, in which we will sample from a *virtual environment*. -- The `bowl` data frame in the `moderndive` R package contains data on a "population" (or a *virtual bowl*) of red and white balls. - We will use R to collect *virtual samples*. --- # A "Population" ```r bowl ``` ``` # A tibble: 2,400 × 2 ball_ID color <int> <chr> 1 1 white 2 2 white 3 3 white 4 4 red 5 5 white 6 6 white 7 7 red 8 8 white 9 9 red 10 10 white # … with 2,390 more rows ``` - This tells us that there are 2,400 total balls, with each *equally likely* to be selected in a virtual sample. --- # One Virtual Sample To collect a *virtual sample* of size `\(n=50\)`, we will use the `rep_sample_n()` function in the `moderndive` package. - In `rep_sample_n()`, the `rep` stands for **repeat**, and the `n` refers to the **size** of the *virtual sample*. ```r virtual_sample = rep_sample_n(bowl, size = 50) ``` ```r View(virtual_sample) ``` --- # One Virtual Sample If you followed along, you might notice that your `ball_ID` variable contains different numbers from mine. - This is because we each used R's **random number generator** when we ran `rep_sample_n()`, so our samples are all different! -- Next, calculate the proportion of balls in your *virtual sample* that are red: ```r virtual_sample %>% summarize(prop = sum(color == "red")/50) ``` ``` # A tibble: 1 × 2 replicate prop <int> <dbl> 1 1 0.34 ``` --- # One Virtual Sample In my *virtual sample*, there were 34% red balls. Now let's find the proportion of the 2,400 balls in the `bowl` data frame that are red: ```r bowl %>% summarize(prop = sum(color == "red")/2400) ``` ``` # A tibble: 1 × 1 prop <dbl> 1 0.375 ``` - I was close! How close were you? --- # Many Virtual Samples While I had each of you take a *virtual sample*, there is an easy way for each of us to *simulate* many virtual samples, using `rep_sample_n()`. Here's how you can simulate **30 samples**, each of size `\(n=50\)`, from the "population" of 2,400 balls: ```r virtual_samples_30 = rep_sample_n(bowl, size = 50, reps = 30) ``` -- The syntax is almost identical to our single *virtual sample*. - However, we add the `reps` argument, which indicates that we want to **repeat** the sampling 30 times. --- # Many Virtual Samples Notice that when we `View(virtual_samples_30)`, the first 50 rows of `replicate` are equal to `1`, the next 50 rows equal to `2` and so on, until you reach replicate `30`. - Therefore, there are `\(30\cdot 50=1500\)` rows in this data frame. -- We need to calculate the proportion of red balls **for each replicate**. How can we calculate the proportion of red balls in each group? --- # Summarizing Proportion by Group We can use the `dplyr` syntax! ```r virtual_prop_red = virtual_samples_30 %>% group_by(replicate) %>% summarize(prop = sum(color == "red")/n()) ``` ```r View(virtual_prop_red) ``` -- Using `View(virtual_prop_red)`, we see that there are now *30 columns*: each column gives a summary measure for the corresponding **replicate**. --- # Practice Let's construct a **histogram** of the 30 `prop_red` values in `virtual_prop_red`. - What do you notice about the **distribution**? ```r ggplot(data = virtual_prop_red, aes(x = prop)) + geom_histogram(color = "white", binwidth = 0.05) ``` --- # 1,000 Virtual Samples With each of our 30 *virtual samples*, we can see that there is **variation** between each sample. - But we could get an even better idea of **sampling variability** if we use *more* than 30 replicates. Instead of `reps = 30`, now let's try `reps = 1000` for 1,000 *virtual samples* of size 50! ```r virtual_samples_1000 = rep_sample_n(bowl, size = 50, reps = 1000) ``` **Note**: You could try `View(virtual_samples_1000)`, but there are `\(1000\cdot 50=50000\)` rows! --- # Practice Perform the *exact same* calculations as with `virtual_samples_30`, but now with `virtual_samples_1000`: 1. Use `dplyr` to calculate the proportion of red balls in each **replicate**. 2. Construct a **histogram** of the 1,000 `prop_red` values (adapt the code from several slides back). What do you notice about the shape of the histogram? --- # Solution ```r virtual_prop_red_1000 = virtual_samples_1000 %>% group_by(replicate) %>% summarize(prop = sum(color == "red")/n()) ``` ```r ggplot(data = virtual_prop_red_1000, aes(x = prop)) + geom_histogram(color = "white", binwidth = 0.05) ``` --- # 1,000 Virtual Samples .center[ ![](07-Sampling_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] --- # Many Large Virtual Samples So far, we have been controlling *how many* **samples of size 50** we take from the "population." - But we can also control the **sample size**. Repeat the same exercise as before using 1,000 replicates, but this time use a **sample size of 100**. -- ```r virtual_samples_100 = rep_sample_n(bowl, size = 100, reps = 1000) virtual_prop_red_100 = virtual_samples_100 %>% group_by(replicate) %>% summarize(prop = sum(color == "red")/n()) ``` --- # 1,000 Virtual Samples of size 100 .center[ <img src="07-Sampling_files/figure-html/unnamed-chunk-19-1.png" width="50%" /> ] These types of *distributions* are very special: they are called **sampling distributions**. - A **sampling distribution** is a distribution of **sample statistics**. --- # Sampling Distributions A **sampling distribution** is a distribution of **sample statistics**. The sample statistics in this case are **sample proportions**, `\(\hat{p}\)`. - In each sample, `\(\hat{p}\)` is the proportion of balls that are red. - We took *many* samples, so we were able to plot a histogram of the *many* `\(\hat{p}\)`. --- # Mean, SD for n = 10 ```r virtual_samples_10 = rep_sample_n(bowl, size = 10, reps = 1000) virtual_prop_red_10 = virtual_samples_10 %>% group_by(replicate) %>% summarize(prop = sum(color == "red")/n()) summary(virtual_prop_red_10$prop) ``` ``` Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.3000 0.4000 0.3706 0.5000 0.9000 ``` --- # Mean, SD for n = 50 ```r virtual_samples_50 = rep_sample_n(bowl, size = 50, reps = 1000) virtual_prop_red_50 = virtual_samples_50 %>% group_by(replicate) %>% summarize(prop = sum(color == "red")/n()) summary(virtual_prop_red_50$prop) ``` ``` Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2000 0.3200 0.3800 0.3778 0.4200 0.6400 ``` --- # Mean, SD for n = 100 ```r virtual_samples_100 = rep_sample_n(bowl, size = 100, reps = 1000) virtual_prop_red_100 = virtual_samples_100 %>% group_by(replicate) %>% summarize(prop = sum(color == "red")/n()) summary(virtual_prop_red_100$prop) ``` ``` Min. 1st Qu. Median Mean 3rd Qu. Max. 0.230 0.340 0.380 0.377 0.410 0.590 ``` --- # Summary Statistics .center[ <table> <thead> <tr> <th style="text-align:right;"> sample.size </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0.367 </td> <td style="text-align:right;"> 0.1500 </td> </tr> <tr> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0.377 </td> <td style="text-align:right;"> 0.0680 </td> </tr> <tr> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 0.375 </td> <td style="text-align:right;"> 0.0471 </td> </tr> </tbody> </table> ] -- As the sample size **increases**, the standard deviation **decreases**. - These types of standard deviations are so special that they get their own name: **standard error**. - **Standard errors** quantify the effect of sampling variation induced on our estimates. --- # Summary Statistics .center[ <table> <thead> <tr> <th style="text-align:right;"> sample.size </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0.367 </td> <td style="text-align:right;"> 0.1500 </td> </tr> <tr> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0.377 </td> <td style="text-align:right;"> 0.0680 </td> </tr> <tr> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 0.375 </td> <td style="text-align:right;"> 0.0471 </td> </tr> </tbody> </table> ] **Remember**: The *true* "population" proportion is `\(p=0.375\)`. A certain *theorem* states that, as the sample size gets *larger*... 1. The **mean** of the sampling distribution *converges* to `\(p\)`. 2. The **standard deviation** of the sampling distribution *converges* to `\(\sqrt{p(1-p)/n}\)`. 3. The **shape** of the sampling distribution is *Normal*. --- # Sampling Distribution for the Sample Mean This exercise can also be done with **means**. The `pennies` dataset in the `moderndive` package contains the ages of 800 pennies, measured in 2011. - Let's treat this as a "population." ```r View(pennies) ``` -- Using `rep_sample_n()`, take 1,000 replicates of *size 10* from this "population": ```r virtual_pennies_10 = rep_sample_n(pennies, size = 10, reps = 1000) ``` -- Now, calculate the **mean** penny age within each replicate: ```r virtual_mean_10 = virtual_pennies_10 %>% group_by(replicate) %>% summarize(mean_age = mean(age_in_2011)) ``` --- # Practice Repeat this exercise for `\(n=50\)` and `\(n=100\)`. -- **Solution** ```r virtual_pennies_50 = rep_sample_n(pennies, size = 50, reps = 1000) virtual_mean_50 = virtual_pennies_50 %>% group_by(replicate) %>% summarize(mean_age = mean(age_in_2011)) ``` ```r virtual_pennies_100 = rep_sample_n(pennies, size = 100, reps = 1000) virtual_mean_100 = virtual_pennies_100 %>% group_by(replicate) %>% summarize(mean_age = mean(age_in_2011)) ``` --- # Summary Statistics .center[ <table> <thead> <tr> <th style="text-align:right;"> sample.size </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 21.09 </td> <td style="text-align:right;"> 3.98 </td> </tr> <tr> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 21.20 </td> <td style="text-align:right;"> 1.70 </td> </tr> <tr> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 21.13 </td> <td style="text-align:right;"> 1.44 </td> </tr> </tbody> </table> ] -- As the sample size **increases**, the standard error **decreases**. -- ```r summary(pennies$age_in_2011) ``` ``` Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 11.00 20.00 21.15 30.00 63.00 ``` As the sample size **increases**, the *mean of the sampling distribution* gets closer to the *true mean*. --- # Summary Statistics .center[ <table> <thead> <tr> <th style="text-align:right;"> sample.size </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 21.09 </td> <td style="text-align:right;"> 3.98 </td> </tr> <tr> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 21.20 </td> <td style="text-align:right;"> 1.70 </td> </tr> <tr> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 21.13 </td> <td style="text-align:right;"> 1.44 </td> </tr> </tbody> </table> ] **Remember**: The *true* "population" mean is `\(\mu=21.15\)` with a standard deviation of `\(\sigma=12.44\)`. A certain *theorem* states that, as the sample size gets *larger*... 1. The **mean** of the sampling distribution *converges* to `\(\mu\)`. 2. The **standard deviation** of the sampling distribution *converges* to `\(\sigma/\sqrt{n}\)`. 3. The **shape** of the sampling distribution is *Normal*. --- # Why do we sample? In both the *virtual* simulations and in **real life**, our goal is the same: - Estimate the *true* proportion, by extracting samples from a *population*. -- Additionally, we discussed two key concepts: 1. The effect of **sampling variation** on our estimates (i.e., `\(\hat{p}\)` or `\(\bar{x}\)`). 2. The effect of **sample size** on sampling variation. --- # Central Limit Theorem What we have illustrated throughout class today is one of the most important theorems in all of statistics. **The Central Limit Theorem (CLT)**: As the sample size `\(n\)` gets larger, the *sampling distribution* of the sample mean (or sample proportion) becomes more *bell-shaped* (i.e., more **Normally** distributed and more narrow). Specifically, we can write the following: `$$\bar{x}\sim \text{Normal}\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$` and `$$\hat{p}\sim \text{Normal}\left(p, \sqrt{\frac{p(1-p)}{n}}\right)$$`