CEMA 0907: Statistics in the Real World

class: center, middle, inverse, title-slide

# CEMA 0907: Statistics in the Real World
## Theory-Based Tests
### Anthony Scotina

---

# Needed Packages

```r
library(tidyverse)
library(infer)
library(nycflights13)
```

---

# Theory vs Simulation

So far we have been focused on using technology and simulation-based methods in order to do and understand statistical inference and data analysis.

- **Bootstrap resampling** for approximating a *sampling distribution* and constructing **confidence intervals**

- **Permutation tests** for constructing a *null distribution* and performing hypothesis tests for *any* parameter.

However, more traditional *theory-based methods* have been around for decades, mostly because we didn't have the technology to perform any of the simulation-based methods covered this semester.

- In fact, R was first created in 1993 and was not widely used until the 2000s!

- Theory-based methods are still used in a variety of fields and situations.

---

# Some Stats History

.pull-left[
William Sealy Gosset worked as a chemist for the Guinness brewery in Ireland.

- Gosset implemented the **t-test** as a way to monitor the quality of stout beer.

- Gosset published this work in *Biometrika* (a top statistical journal) in 1908, but Guinness had a policy that forbade its chemists from publishing their findings while working for the company.

- The pseudonum *Student* was used instead. 
]

.pull-right[
<img src="gosset.jpg" width="75%" />
]

---

# Two-sample t-test

Ultimately any theory-based hypothesis test is as an *approximation* to the simulation-based hypothesis test procedures we've used so far.

- Theory-based tests typically have several *assumptions* that we have to verify in order to use them.

We'll focus on the **two-sample t-test** for the *difference between sample means*.

- However, the *test statistic* we'll use isn't `$\bar{x}_{1}-\bar{x}_{2}$`; rather, the two-sample t-test uses the **t-statistic**.

---

# Two-sample t-statistic

Recall that we use *z*-scores in order to **standardize** a variable: `$$z=\frac{x-\mu}{\sigma}$$`

This allows us to compare observations from different distributions.

- **Examples**: 
    - Comparing ACT and SAT scores 
    - Comparing weather in different states (Is today's temperature in Boston warmer than the temperature in Orlando, relative to the respective distributions?)

The *z*-score **re-centers** a unimodal/symmetric (Normal) variable around 0.

---

# Standard Normal Distribution

The distribution of *z*-scores for *every* observation in a Normal variable is the **standard Normal distribution** (or a *z-distribution*).

.center[
<img src="standard_normal.png" width="1792" />
]

---

# Back to Movie Ratings

For the `movies_sample` dataset, how would we **standardize** `$\bar{x}_{r}-\bar{x}_{a}$`?

- Recall that this is the difference in *average* movie ratings between *romance* and *action* movies.

**Recall (Central Limit Theorem)**:

- Assuming the sampling is *representative*, then the *sampling distribution* for `$\bar{x}_{r}-\bar{x}_{a}$` will be centered at the (unknown) `$\mu_{r}-\mu_{a}$`.

- The *standard deviation of the difference in sample means* is the **standard error**, `$SE_{\bar{x}_{r}-\bar{x}_{a}}$`.

Putting this all together, the **two-sample t-statistic** is: `$$t=\frac{(\bar{x}_{r}-\bar{x}_{a})-(\mu_{r}-\mu_{a})}{SE_{\bar{x}_{r}-\bar{x}_{a}}}=\frac{(\bar{x}_{r}-\bar{x}_{a})-(\mu_{r}-\mu_{a})}{\sqrt{\frac{s_{r}^{2}}{n_{r}}+\frac{s_{a}^{2}}{n_{a}}}}$$`

---

# Using the t-statistic

In the *t-test*, we use the *t-statistic* similarly to how we used the **observed difference** in the simulation-based permutation tests.

`$$t=\frac{(\bar{x}_{r}-\bar{x}_{a})-(\mu_{r}-\mu_{a})}{SE_{\bar{x}_{r}-\bar{x}_{a}}}=\frac{(\bar{x}_{r}-\bar{x}_{a})-(\mu_{r}-\mu_{a})}{\sqrt{\frac{s_{r}^{2}}{n_{r}}+\frac{s_{a}^{2}}{n_{a}}}}$$`

- Assuming `$H_{0}:\mu_{r}-\mu_{a}=0$` as we did before, the right-hand side in the above formula becomes 0.

---

# The t-distribution

Similarly to how we use the CLT to say that `$\bar{X}\sim N(\mu, \sigma/\sqrt{n})$` for large `$n$`, it can be *mathematically proven* that the *t*-statistic follows a **t-distribution** with **degrees of freedom** roughly equal to `$df=n_{r}+n_{a}-2$`.

- For larger sample sizes, the *df* increases and the *t*-distribution converges to the *standard Normal*.

.center[
<img src="t_distributions.png" width="75%" />
]

---

# The t-statistic

```r
movies_sample %>%
  group_by(genre) %>%
  summarize(mean_rating = mean(rating), 
            sd_rating = sd(rating), 
            n_genre = n())
```

```
## # A tibble: 2 × 4
##   genre   mean_rating sd_rating n_genre
##   <chr>         <dbl>     <dbl>   <int>
## 1 Action         5.04      1.59     100
## 2 Romance        6.19      1.31     100
```

Plugging these into the *t*-statistic formula, `$$t=\frac{(\bar{x}_{r}-\bar{x}_{a})-(0)}{\sqrt{\frac{s_{r}^{2}}{n_{r}}+\frac{s_{a}^{2}}{n_{a}}}}=\frac{(6.19-5.04)}{\sqrt{\frac{1.31^2}{100}+\frac{1.59^2}{100}}}=5.58$$`

---

# The t-statistic

`$$t=\frac{(\bar{x}_{r}-\bar{x}_{a})-(0)}{\sqrt{\frac{s_{r}^{2}}{n_{r}}+\frac{s_{a}^{2}}{n_{a}}}}=\frac{(6.19-5.04)}{\sqrt{\frac{1.31^2}{100}+\frac{1.59^2}{100}}}=5.58$$`

We used the observed *sample difference in means* to calculate the **test statistic**, which is a *t-statistic* in a two-sample t-test.

As we have done with permutation tests, next we have to use the test statistic to calculate the **p-value**.

- How? Compare *t* to a **null distribution**.

---

# Null Distribution: Simulation vs Theory

First, recall the *null distribution* for the permutation test of `$\bar{x}_{r}-\bar{x}_{a}$` from earlier in the module:

```r
null_distribution_rating = movies_sample %>%
  specify(response = rating, explanatory = genre) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
* calculate(stat = "diff in means", order = c("Romance", "Action"))
visualize(null_distribution_rating)
```

We can also *simulate* a null distribution for theory-based test statistics, such as the *t*-statistic:

```r
null_distribution_rating_t = movies_sample %>%
  specify(response = rating, explanatory = genre) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
* calculate(stat = "t", order = c("Romance", "Action"))
visualize(null_distribution_rating_t)
```

---

# Null Distribution: Simulation vs Theory

.pull-left[
![](10-Theory_Testing_files/figure-html/unnamed-chunk-10-1.png)
]

.pull-right[
![](10-Theory_Testing_files/figure-html/unnamed-chunk-11-1.png)
]

---

# Theory-Based Null Distribution

However, because theory-based methods were used before computers were around, the traditional theory-based *t*-test doesn't look at a *simulated histogram* of the null distribution.

- Instead, it looks at the *t*-distribution curve with `$df=n_{r}+n_{a}-2=198$`:

```r
visualize(null_distribution_rating_t, method = "both", 
          dens_color = "red", fill = "gray") + 
  theme_bw()
```

```
## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.
```

---

# Theory-Based Null Distribution

```r
visualize(null_distribution_rating_t, method = "both", 
          dens_color = "red", fill = "gray") + 
  theme_bw()
```

It seems that the theoretical curve does a good job of *approximating* the histogram.

- To calculate the p-value, we need to calculate the total area under the *t-distribution curve* that is equal to or "more extreme" than the observed *t*-statistic (5.58).

# Theory-Based p-value

```r
null_distribution_rating_t %>%
  get_p_value(obs_stat = 5.58, direction = "both")
```

```
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
```

```
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0
```

OR, using a non-simulation-based approach:

```r
t.test(rating ~ genre, data = movies_sample)
```

---

# Conditions for two-sample t-test

Let's go back to a warning message we saw when using `get_p_value()` with the *t*-test:

> `Check to make sure the conditions have been met for the theoretical method. {infer} currently does not check these for you.`

To be able to use *t*-tests and other theory-based statistical procedures there are a few **conditions** to check.

1. Nearly Normal populations *or* large sample sizes (usually `$n_{1}>30$` and `$n_{2}>30$`)

2. Both *samples* are selected independently *of each other*.

3. All *observations* are independent from each other.

Only **Condition 3** is questionable in the movie ratings example. For example, if *the same person rated multiple movies*, then the osbervations are not independent.

---

# Practice

**Use a two-sample t-test** to determine whether there is a *significant difference* in **average** flight delays (`dep_delay`) between **United Airlines** and **Jet Blue Airways** (`carrier`).

First, let's filter the `flights` dataset for the two airlines. Use `flights_ex` for the rest of this exercise.

```r
flights_ex = flights %>%
  na.omit() %>% # removes missing values
  filter(carrier %in% c("UA", "B6"))
```

---

# What happens if a condition isn't met?

If any of the conditions for a theory-based hypothesis test are not met, then we can't put much faith into any of the conclusions reached using these tests.

- In these cases, we could instead use a **permutation test** since these don't require a list of conditions to check!

Run the following to load a dataset consisting of waiting times of 24 automobile oil changes, 12 from store "A" and 12 from store "B".

```r
oil = data.frame(
  times = c(104, 102, 35, 911, 56, 325, 7, 9, 179, 59, 4, 19,
            25, 45, 10, 31, 187, 342, 531, 21, 28, 2104, 2, 12), 
  store = c(rep("A", 12), rep("B", 12))
  )
```

---

# What happens if a condition isn't met?

Clearly, the sample sizes in the `oil` dataset are small, and each separate distribution of waiting times is **right-skewed**.

```r
ggplot(data = oil, aes(x = times, fill = store)) + 
  geom_density()
```

- Therefore it might be better to look at a hypothesis test for **difference between medians**.

- But we can't do that with a *t*-test! Those are only for **means**!

---

# Why doesn't the t-test work here?

```r
oil %>%
  specify(response = times, explanatory = store) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000) %>%
  calculate(stat = "t", order = c("A", "B")) %>%
  visualize(method = "both")
```

```
## Setting `type = "permute"` in `generate()`.
```

```
## Warning: Check to make sure the conditions have been met for the theoretical
## method. {infer} currently does not check these for you.
```