class: center, middle, inverse, title-slide # CEMA 0907: Statistics in the Real World ## Confidence Intervals ### Anthony Scotina --- class: center, middle, frame # Introduction --- # Recap If *sampling* of a sample of size `\(n\)` is done at **random**, then the resulting sample is *unbiased* and **representative** of the **population**. - Thus, the **sample statistic** from the representative sample represents a "good guess" of the (unknown) **population parameter**. Using the `bowl` data frame in the `moderndive` R package, we used the **sample proportion**, `\(\hat{p}\)`, to estimate the **population proportion**, `\(p\)`. Generally, we will use the *sample* to **infer** about the *population*. --- # In reality... In *most cases*, we don't have the population values like we did with the `bowl` data frame, and we don't take many samples from the population. - We only have a single sample of data from a larger population! While the **sample statistic** represents our single *best guess* at the (unknown) **population parameter**, we would also like to create a *range of plausible values* for the population parameter. - This range is called a **confidence interval**. -- How do we use a single sample to get some idea of how other samples might vary in terms of their statistic values? - **Bootstrapping** --- # Needed Packages ```r library(moderndive) library(tidyverse) library(infer) ``` --- class: center, middle, frame # Bootstrapping --- # `pennies_sample` ```r pennies_sample ``` ``` # A tibble: 50 × 2 ID year <int> <dbl> 1 1 2002 2 2 1986 3 3 2017 4 4 1988 5 5 2008 6 6 1983 7 7 2008 8 8 1996 9 9 2004 10 10 2000 # … with 40 more rows ``` **Question**: What is the *average year* on US pennies in 2019? - Can we use the sample of 50 pennies to help answer this question? --- # Exploratory Data Analysis: Data Visualization .pull-left[ ![](08-Confidence_Intervals_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] .pull-right[ Most of the ages are between 1980 and 2000, though there is a large spike in 2018. - If this sample is **representative** of the population of US pennies, we would expect the distribution of *all* pennies' year to have a similar shape. ] --- # Exploratory Data Analysis: Summary Statistics Because we are interested in the **mean** year of *all* US pennies in 2019, let's calculate the **sample mean**, `\(\bar{x}\)`, of our 50 pennies using `mean()`: ```r x_bar = mean(pennies_sample$year) x_bar ``` ``` [1] 1995.44 ``` Therefore, our **point estimate** is `\(\bar{x}=1995.44\)`. This represents our *best guess* at the **population mean** age of all US pennies, `\(\mu\)`. --- # Sampling Variability **Why are we interested in this?** *Every* **sample statistic** has some variability. - Suppose you take a random sample of 50 Brown University students and five are left-handed. -- If you take a *different* random sample of 50 Brown University students, how many would you *expect* to be left-handed? - Suppose three are left-handed. Is that surprising? - Would 40 left-handed students out of 50 be surprising? **Two ways to measure variability**: 1. Theory (Central Limit Theorem, etc.) -- AP/Introductory Statistics 2. **Simulation** (e.g., *bootstrapping*) -- **THIS CLASS** --- # The Bootstrapping Process Bootstrapping uses a process of sampling **with replacement** from our original sample to create new bootstrap samples of the **same size** as our original sample. We can generate a *single bootstrap sample* by using the `rep_sample_n()` function from earlier: ```r bootstrap_sample1 = rep_sample_n(pennies_sample, size = 50, reps = 1, replace = TRUE) ``` -- - Notice that `size=50`. This *isn't an arbitrary number*. When bootstrapping, the `size` value will *always be the same as the original sample size*! - We add a new argument to `rep_sample_n()`, `replace = TRUE`. This means that when a penny is selected for our **bootstrap sample**, it has the chance to be selected *again*. --- # Bootstrapping When using the bootstrap, it might help to think of our original sample *as if* it were the population. - If the sample is *representative*, then the population might as well just be tons of copies of the original sample. -- **Example**: Meet some "data": .center[ <img src="ac_beau.jpeg" width="80" /><img src="ac_diva.jpeg" width="80" /><img src="ac_rod.jpeg" width="79" /><img src="ac_pango.jpeg" width="81" /><img src="ac_goose.jpeg" width="80" /><img src="ac_dora.jpeg" width="79" /> ] --- # How Bootstrapping Works **One Sample** `\(\implies\)` *One Sample Statistic* .center[ <img src="bootstrap1.png" width="280" /> ] --- # How Bootstrapping Works **One Sample** `\(\implies\)` **Bootstrap Sample** `\(\implies\)` *Bootstrap Statistic* .center[ <img src="bootstrap2.png" width="548" /> ] --- # How Bootstrapping Works **One Sample** `\(\implies\)` **Bootstrap Samples** `\(\implies\)` *Bootstrap Statistics* .center[ <img src="bootstrap3.png" width="540" /> ] --- # How Bootstrapping Works **One Sample** `\(\implies\)` **Many Bootstrap Samples** `\(\implies\)` *Many Bootstrap Statistics* .center[ <img src="bootstrap4.png" width="541" /> ] --- # Why Bootstrapping Works If the sample is **representative**, the *population* might as well be *many copies of the sample*. .center[ <img src="bootstrap5.png" width="502" /> ] --- # Why Bootstrapping Works If the sample is **representative**, the *population* might as well be *many copies of the sample*. .center[ <img src="bootstrap6.png" width="616" /> ] --- class: center, middle, frame # The `infer` Package for Statistical Inference --- # `infer` The `infer` package provides a useful resource in building **confidence intervals** and conducting **hypothesis tests** (more on those later). There are several *verb-named functions* that build in order. .center[ <img src="infer_hex.png" width="30%" /> ] --- # Step 1: `specify()` The `specify()` function is used primarily to choose which variables will be the focus of the statistical inference. - This is done using the `response = ` argument: ```r pennies_sample %>% specify(response = year) ``` ``` Response: year (numeric) # A tibble: 50 × 1 year <dbl> 1 2002 2 1986 3 2017 4 1988 5 2008 6 1983 7 2008 8 1996 9 2004 10 2000 # … with 40 more rows ``` --- # Step 2: `generate()` replicates After `specify()`-ing the main variable of interest, we use `generate()` to generate **bootstrap samples** (i.e., *replicates*) from the original sample. - Here, we can easily create 1,000 bootstrap samples: ```r pennies_sample %>% specify(response = year) %>% generate(reps = 1000, type = "bootstrap") ``` ``` ## Response: year (numeric) ## # A tibble: 50,000 × 2 ## # Groups: replicate [1,000] ## replicate year ## <int> <dbl> ## 1 1 2015 ## 2 1 1971 ## 3 1 1993 ## 4 1 1990 ## 5 1 2008 ## 6 1 1983 ## 7 1 1996 ## 8 1 1981 ## 9 1 2017 ## 10 1 2015 ## # … with 49,990 more rows ``` If you view this dataset, you will see that there are 50,000 rows! - We took 1,000 bootstrap samples, each of size 50. --- # Step 3: `calculate()` summary statistics Once we have 1,000 **bootstrap samples**, we need to calculate a **summary statistic** for each sample. - In this example, the *summary statistic* is the **mean**. ```r pennies_sample %>% specify(response = year) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") ``` -- This generates a data frame with 1,000 rows: each row containing the sample mean of the respective **bootstrap sample**. - This set of 1,000 sample means represents a **bootstrap distribution**. --- # Step 4: `visualize()` the results .center[ <img src="infer.png" width="515" /> ] --- # Step 4: `visualize()` the results ```r pennies_sample %>% specify(response = year) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") %>% visualize() ``` <img src="08-Confidence_Intervals_files/figure-html/unnamed-chunk-20-1.png" width="40%" /> --- class: center, middle, frame # Confidence Intervals --- # Confidence Intervals A **confidence interval** gives a range of *plausible values* for a population parameter. - Using *only* a **sample statistic** to estimate a parameter is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net. .center[ <img src="fishing.png" width="50%" /> ] --- # Confidence Intervals Confidence intervals depend on a specified **confidence level**, with... - *higher* confidence levels corresponding to *wider* confidence intervals, and - *lower* confidence levels corresponding to *narrower* confidence intervals. Common **confidence levels** include 90%, 95%, and 99%. -- Using the **bootstrap distribution** (i.e., `bootstrap_distribution` from earlier), we can use the **percentile method** to obtain a confidence interval for the population mean age of US pennies. --- # The Percentile Method The only thing you need to do here is to use the `get_ci()` function from the `infer` package: ```r percentile_ci = pennies_sample %>% specify(response = year) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") %>% get_ci(level = 0.95, type = "percentile") # 0.95 and "percentile" are default values percentile_ci ``` ``` # A tibble: 1 × 2 lower_ci upper_ci <dbl> <dbl> 1 1991. 1999. ``` -- The **percentile method** gives us the 2.5th and the 97.5th *percentiles* of the bootstrap distribution. - Our range of plausible values for the mean year of US pennies in 2019 is between 1991 and 2000 years, with **95% confidence**. --- # Visualizing the CI A cool thing you can do in R is to use the `visualize()` function to plot the *confidence interval* on top of the bootstrap distribution histogram. - Run the following: ```r pennies_sample %>% specify(response = year) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") %>% visualize() + shade_ci(endpoints = percentile_ci, color = "hotpink", fill = "chartreuse") ``` --- class: center, middle, frame # Interpreting the Confidence Interval --- # Calculating Many CIs -- .center[ <img src="ci_sim.png" width="60%" /> ] --- # Calculating Many CIs Using 95% as our **confidence level**, *approximately* 94 of the CIs contained the population mean `\(\mu=1995.133\)`, while 6 did not. **What does this mean?** - The procedure we have used to generate confidence intervals is "*95% reliable*" in that we can expect it to include the true population parameter **approximately** 95% of the time *if the process is repeated*. --- # Back to our example... **What is a precise interpretation of a confidence interval?** Recall our *original 95% confidence interval* using the **percentile method**: `\([1991, 2000]\)`. **Interpretation**: We are **95% confident** that the average year on US pennies in 2019 is between 1991 and 2000, using the **percentile method**. --- # Interpretating a CI In general... **Precise interpretation**: If we repeated our sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population parameter. **Short-hand interpretation**: We are 95% “confident” that a 95% confidence interval captures the value of the population parameter. --- # Width of Confidence Interval **Confidence level** In order to be more confident in our best guess of a range of values, we need to widen the range of values. - The higher the *confidence level*, the wider a confidence interval will be. -- ```r pennies_sample %>% specify(response = year) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") %>% get_ci(level = 0.80, type = "percentile") ``` ``` ## # A tibble: 1 × 2 ## lower_ci upper_ci ## <dbl> <dbl> ## 1 1993. 1998. ``` --- # Width of Confidence Interval **Confidence level** In order to be more confident in our best guess of a range of values, we need to widen the range of values. - The higher the *confidence level*, the wider a confidence interval will be. ```r pennies_sample %>% specify(response = year) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") %>% get_ci(level = 0.95, type = "percentile") ``` ``` ## # A tibble: 1 × 2 ## lower_ci upper_ci ## <dbl> <dbl> ## 1 1991. 2000. ``` --- # Width of Confidence Interval **Confidence level** In order to be more confident in our best guess of a range of values, we need to widen the range of values. - The higher the *confidence level*, the wider a confidence interval will be. ```r pennies_sample %>% specify(response = year) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") %>% get_ci(level = 0.99, type = "percentile") ``` ``` ## # A tibble: 1 × 2 ## lower_ci upper_ci ## <dbl> <dbl> ## 1 1990. 2000. ``` --- # Impact of Sample Size In general, *larger sample sizes tend to produce narrower confidence intervals*. - As our sample size increases, our estimate gets more *precise*. - Also, the **standard error decreases**. For example, a 95% confidence interval with `\(n=50\)` will be *narrower* than a 95% confidence interval with `\(n=25\)`.