CEMA 0907: Statistics in the Real World

# CEMA 0907: Statistics in the Real World
## Hypothesis Testing
### Anthony Scotina

---

# Needed Packages

```r
library(tidyverse)
library(infer)
library(nycflights13)
```

---

# When inference is not needed

---

# Back to the `flights` data frame

Sometimes, it isn't necessary to perform a statistical hypothesis test. 
- *Always* do **exploratory data analysis**!

```r
View(flights)
```

---

# Average Flight Times

Assuming two flights leave from New York, which do you think is longer?
- The flight to **Boston**?
- The flight to **San Francisco**?

```r
bos_sfo = flights %>%
  na.omit() %>% # removes flights with missing info
  filter(dest %in% c("BOS", "SFO")) 
```

---

# Exploratory Data Analysis: Summary Statistics

```r
bos_sfo %>%
  group_by(dest) %>%
  summarize(mean = mean(air_time), sd = sd(air_time))
```

```
# A tibble: 2 × 3
  dest   mean    sd
  <chr> <dbl> <dbl>
1 BOS    39.0  4.95
2 SFO   346.  17.2 
```

---

# Exploratory Data Analysis: Data Visualization

```r
ggplot(data = bos_sfo, mapping = aes(x = dest, y = air_time)) + 
  geom_boxplot()
```

---

# Summary

There is *no overlap at all* in the boxplots!
- This means that the air time for San Francisco flights is **statistically greater** than the air time for Boston flights (which isn't surprising).

The procedures that follow *would not be necessary* for data such as these. 
- Always do **exploratory data analysis**!!!

---

# Basics of Hypothesis Testing

---

# Hypothesis Tests

In a statistical **hypothesis test**, we use *sample data* to help us decide between two competing hypotheses about a *population parameter*. 
- The **null hypothesis** 
    - `\(H_{0}\)`
- The **alternative hypothesis**
    - `\(H_{a}\)`

The **alternative hypothesis** is the claim for which we seek significant evidence. 
- **Example**: Are the flight times for flights from NY to BOS or SFO *significantly different*?

The **null hypothesis** is usually a claim of *no effect* or *no difference*. 
- **Example**: There is *no difference* in the flight times from NY to BOS or SFO.

---

# Assessing Strength of Evidence

First, we *assume that the null hypothesis is true*. 
- Then, we determine *how unlikely* it would be to observe **sample statistics** as extreme (or more extreme) than the ones in the original sample.

**Don't worry!** This might not sound immediately straightforward. 
- Seasoned scientists often misunderstand statistical hypothesis tests.

**Criminal trial analogy**: *Innocent until proven guilty*

---

# Two Possible Conclusions

These are the *only two* possible conclusions to a statistical hypothesis test:

- **Reject** the null hypothesis `\(H_{0}\)` `\(\rightarrow\)` **Accept** the alternative hypothesis `\(H_{a}\)` 
- **Fail to reject** the null hypothesis `\(H_{0}\)`

Therefore, *never* write "Accept `\(H_{0}\)`" when you mean "Fail to reject `\(H_{0}\)`!"

---

# Statistical Significance

The results from a hypothesis test are **statistically significant** if they are *more extreme* than what we would expect to see by random chance, **if the null hypothesis were true**. 
- In other words, we have *convincing evidence* in favor of the alternative hypothesis, allowing us to generalize our *sample* results to the claim about the *population*.

**Example**: Back to Sarah the chimp...

---

# Sarah the chimp

.pull-left[
- In 1978, researchers Premack and Woodruff published a study in *Science* magazine, reporting an experiment where an adult chimpanzee named Sarah was shown videotapes of eight different scenarios of a human being faced with a problem.

- After each videotape showing, she was presented with two photographs, one of which depicted a possible solution to the problem.

- Sarah could pick the photograph with the correct solution for seven of the eight problems!
]

---

# How?!

What are **two possible explanations** for Sarah getting 7 correct answers out of 8?

1. Sarah was just guessing and got lucky.

2. Sarah can do better than just guessing.

Explanation 1 is the **null hypothesis**. Is there enough evidence to reject the explanation that Sarah was *guessing*?

---

# Simulating Guessing

If Sarah were just guessing, we would *expect* the number of correct guesses to be 4.

- However, not every set of 8 coin tosses will result in 4 heads.

- Let's repeat the set of 8 coin tosses many times, to generate the pattern for correct answers that could happen in the long run, **under the assumption that Sarah is just guessing**.

--
.center[
<img src="09-Hypothesis_Testing_files/figure-html/unnamed-chunk-8-1.png" width="45%" />
]

---

# Hypothesis Testing in R

---

# `infer`

---

# `hypothesize()`

You have seen most of these `infer` package functions when we constructed **confidence intervals**. 
- Confidence intervals are *closely related* to hypothesis tests!

The only new verb-named fucntion here is `hypothesize()`:
- Its main argument is `null`, which is either:
    - `point` for hypotheses involving a *single sample*
    - `independence` for hypotheses involving a *comparison between groups*
    
---

# Comparing Two Means

First we will build hypotheses which look at the **difference between two population means**.

Recall that `\(\mu\)` ("mu") denotes a single population mean. 
- When we have *two groups*... 
    - Let `\(\mu_{1}\)` be the population mean of group 1. 
    - Let `\(\mu_{2}\)` be the population mean of group 2. 
    
--

**Two competing hypotheses**

`\(H_{0}\)`: `\(\mu_{1}=\mu_{2}\)` (The population means are equivalent between groups.)

`\(H_{a}\)`: `\(\mu_{1}\neq \mu_{2}\)` (or `\(<\)`, `\(>\)` depending on the context)

---

# Action vs. Romance Movies

The `movies_sample` dataset contains a sample of 200 *action* and *romance* movies, along with their title, year of release, and IMDB.com rating.

```r
movies_sample
```

```
# A tibble: 200 × 4
   title                        year rating genre 
   <chr>                       <dbl>  <dbl> <chr> 
 1 Sinnui yauman III: Do Do Do  1991    7   Action
 2 Thunderpoint                 1998    7   Action
 3 Navajo Blues                 1996    3   Action
 4 Steel and Lace               1991    5.4 Action
 5 Joi gin a long               1999    6.6 Action
 6 Collectors, The              1999    3.8 Action
 7 Lawless Empire               1945    5.1 Action
 8 City on Fire                 1979    3.4 Action
 9 Lost at Sea                  1995    5.8 Action
10 Last of the Finest, The      1990    5.2 Action
# … with 190 more rows
```

---

# Action vs. Romance Movies

**Question**: Is there a *significant difference* in the **average movie ratings** of Action vs. Romance movies on IMDB.com?

Let `\(\mu_{a}\)` denote the *population mean* rating of Action movies, and let `\(\mu_{r}\)` denote the *population mean* rating of Romance movies. 
- `\(H_{0}\)`: `\(\mu_{a}=\mu_{r}\)`
- `\(H_{a}\)`: `\(\mu_{a}\neq \mu_{r}\)`

---

# Exploratory Data Analysis: Data Visualization

```r
ggplot(data = movies_sample, mapping = aes(x = genre, y = rating)) + 
  geom_boxplot()
```

---

# Exploratory Data Analysis: Summary Statistics

Unlike the comparison in flight times between BOS and SFO destinations, it is unclear if the *average* movie rating is **significantly different** between Action and Romance movies.

Let's calculate **summary statistics** *by group* to gather more detail:

```r
summary_stats = movies_sample %>%
  group_by(genre) %>%
  summarize(mean = mean(rating), sd = sd(rating))
summary_stats
```

```
# A tibble: 2 × 3
  genre    mean    sd
  <chr>   <dbl> <dbl>
1 Action   5.04  1.59
2 Romance  6.19  1.31
```

---

# Exploratory Data Analysis: Summary Statistics

```
# A tibble: 2 × 3
  genre    mean    sd
  <chr>   <dbl> <dbl>
1 Action   5.04  1.59
2 Romance  6.19  1.31
```

Since we are interested in inference about the **difference between means**, our *sample statistic* will be the **difference between sample means**. 
- `\(\bar{x}_{r}=6.19\)` and `\(\bar{x}_{a}=5.04\)`, so `\(\bar{x}_{r}-\bar{x}_{a}=1.15\)`.

---

# Observed Effect

Another way of writing the sample statistic in hypothesis testing is to call it the **observed effect**, `\(\delta^{*}\)` ("delta star").

There is a more convenient way of calculating this in R, using the `infer` package:

```r
obs_diff = movies_sample %>% 
* specify(response = rating, explanatory = genre) %>%
  calculate(stat = "diff in means", order = c("Romance", "Action"))
obs_diff
```

```
Response: rating (numeric)
Explanatory: genre (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1  1.15
```

- Recall the `specify()` and `calculate()` functions from earlier.

---

# Simulating the Null Hypothesis

Now that we calculated the **sample statistic** (or the *observed effect*), the next step in a statistical hypothesis test is to simulate a world where the **null hypothesis is true**. 
- `\(H_{0}\)`: `\(\mu_{a}=\mu_{r}\)`, or `\(\mu_{r}-\mu_{a}=0\)`

We can *simulate* this using `generate()`.

---

# Simulating the Null Hypothesis

If we *assume* that the population mean rating is **equal** between Action and Romance movies, then assuming `\(H_{0}\)` true is equivalent to taking all 200 ratings and *randomly re-shuffling* (**randomizing**) them into the Action and Romance groups. 
- If `\(H_{0}\)` is true, then there is no association between `genre` and `rating`, so each rating is *equally likely* to appear in either group.

This is the framework that `generate()` follows.

```r
movies_sample %>% 
  specify(response = rating, explanatory = genre) %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1) %>% 
  calculate(stat = "diff in means", order = c("Romance", "Action"))
```

```
Response: rating (numeric)
Explanatory: genre (factor)
Null Hypothesis: independence
# A tibble: 1 × 1
     stat
    <dbl>
1 -0.0310
```

---

# Simulating the Null Hypothesis

```r
movies_sample %>% 
* specify(response = rating, explanatory = genre) %>%
* hypothesize(null = "independence") %>%
  generate(reps = 1) %>% 
  calculate(stat = "diff in means", order = c("Romance", "Action"))
```

```
Response: rating (numeric)
Explanatory: genre (factor)
Null Hypothesis: independence
# A tibble: 1 × 1
     stat
    <dbl>
1 -0.0310
```

- `generate(reps = 1)` completes a single **permutation** of sending values of `ratings` to potentially different values of `genre` from which they originally came.

- Similar to **bootstrapping**, we can repeat this process 5,000 times, to simulate a world where the **null hypothesis** is true.

---

# Distribution Under the Null Hypothesis

```r
generated_samples = movies_sample %>% 
  specify(response = rating, explanatory = genre) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 5000) %>%
  calculate(stat = "diff in means", order = c("Romance", "Action"))
```

This creates a *distribution* of 5,000 **simulated** differences in sample means, called the **null distribution**.
- The **null distribution** is similar to the **bootstrap distribution** since we are just resampling from our original sample. 
- The difference here is that we are *assuming the null hypothesis is true*.

---

# The Null Distribution

```r
generated_samples %>%
  visualize()
```

---

# The Null Distribution

```r
generated_samples %>%
  visualize(obs_stat = 1.15)
```

---

# Quantifying the Strength of Evidence

We are interested in seeing if our *observed* sample difference in mean movie rating between Action and Romance of 1.15 is *greater than what would be expected from random chance*. 
- What are the chances of observing a sample average difference in means of 1.15, *in a world where the population averages between genres are equal*?

How can we **quantify** these chances?
- If these chances where *small*, we could **reject the null hypothesis**.

---

# The p-value

the **p-value** is the probability of observing a **sample statistic** as extreme (or more extreme) than the one in the original sample, *assuming the null hypothesis is true*.

```r
generated_samples %>%
  visualize(obs_stat = 1.15, direction = "both")
```

---

# The p-value

.pull-right[
Our *observed* **sample difference in means** was 1.15: On average, Romance movies had a IMDB.com rating that was 1.15 higher than Action movies. 
- We mark this with a *red line*, and shade red all values at or above that value...
    - and also shaded red those values at or below its negative value. 
]

---

# The p-value

.pull-right[
You can probably guess what the p-value will be for this example. 
- The red shading does not intersect the histogram at all!
]

---

# Calculating the p-value

```r
pvalue = generated_samples %>% 
  get_pvalue(obs_stat = 1.15, direction = "both")
pvalue
```

```
# A tibble: 1 × 1
  p_value
    <dbl>
1       0
```

Assuming a 5% **level of significance**, we have evidence supporting the conclusion that the mean rating for romance movies is different from that of action movies.
- p-value `\(<0.05\)`
- In other words, we **reject the null hypothesis** and **conclude the alternative hypothesis**.

---

# Corresponding Confidence Interval

Recall that we also used the `infer` pipeline to generate **95% confidence intervals** for a population parameter.
- It is straightforward to calculate a confidence interval for `\(\mu_{r}-\mu_{a}\)` using the existing `infer` pipeline.

Here is the pipeline we *can use* to obtain the **confidence interval**:

```r
movies_sample %>% 
  specify(response = rating, explanatory = genre) %>% 
  #hypothesize(null = "independence") %>% 
  generate(reps = 5000) %>%
  calculate(stat = "diff in means", order = c("Romance", "Action")) %>% 
  get_ci()
```

```
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.750     1.55
```

---

# Corresponding Confidence Interval

This is almost *identical* to the hypothesis test pipeline!
- Except we can just remove or comment out (`#`) the `hypothesize()` step since we are no longer assuming the null hypothesis is true when we bootstrap.

---

# CI Interpretation

```
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.750     1.55
```

- We are **95% confident** that Romance movies are rated between 0.739 and 1.55 points higher, *on average*, than Action movies on IMDB.com.

---

# A Note on `calculate()`

You can use `calculate()` for *any* hypothesis test; not just those comparing a **difference in means**.

```r
?calculate
```

---

# Hypothesis Test for One Mean

Instead of comparing movie rating between groups, let's compare the overall movie rating to 5:
- `\(H_{0}\)`: `\(\mu=5\)`
- `\(H_{a}\)`: `\(\mu\neq 5\)`

```r
movies_sample %>% 
* specify(response = rating) %>%
* hypothesize(null = "point", mu = 5) %>%
  generate(reps = 1) %>% 
  calculate(stat = "mean")
```

```
Response: rating (numeric)
Null Hypothesis: point
# A tibble: 1 × 1
   stat
  <dbl>
1  5.12
```

- `generate(reps = 1)` completes a single **bootstrap sample** of `ratings` from the *distribution* centered at 5.

---

# Hypothesis Test for One Mean

The **observed sample mean**:

```r
x_bar = movies_sample %>%
  specify(response = rating) %>%
  calculate(stat = "mean")
x_bar
```

```
Response: rating (numeric)
# A tibble: 1 × 1
   stat
  <dbl>
1  5.61
```

---

# Hypothesis Test for One Mean

```r
movies_sample %>% 
  specify(response = rating) %>% 
  hypothesize(null = "point", mu = 5) %>%  
  generate(reps = 5000) %>% 
  calculate(stat = "mean") %>%
  get_p_value(obs_stat = x_bar, direction = "both")
```

```
# A tibble: 1 × 1
  p_value
    <dbl>
1       0
```

---

# Summary

Here are the steps for *any* hypothesis test:

1. Calculate a **sample statistic** (or *observed effect*), `\(\delta^{*}\)`. 
    - Using `summarize()` or `skim()` with the original sample
    
2. Simulate a world where the *null hypothesis is true*. 
    - Using `generate(reps = )` and `calculate()`
    
3. `visualize()` where `\(\delta^{*}\)` appears in the simulated world where the *null hypothesis is true*.

4. Calculate the *probability* that `\(\delta^{*}\)` (or something *more extreme*) could appear in this simulated world.

- Using `get_p_value(obs_stat = , direction = )`

5. Decide if `\(\delta^{*}\)` is **statistically significant**. 
    - Compare the p-value to 0.05, the **level of significance**. 
    
--

Also: **Don't forget to include a confidence interval!**

---

# Conclusion