CEMA 0907: Statistics in the Real World

# CEMA 0907: Statistics in the Real World
## Basic Regression
### Anthony Scotina

---

# Needed Packages

```r
library(tidyverse)
library(moderndive)
library(palmerpenguins) # Install this!
```

---

# Linear Regression

There are *many* ways to model data. For the rest of this class, we will focus on **linear regression**.

**Linear regression modeling** involves:

- a **numerical** outcome variable `$y$`, and

- explanatory variable(s) `$x$` that are either *numerical* or *categorical*

The **model** follows this form: `$$\hat{y}=b_{0}+b_{1}\cdot x$$`
where:

- `$\hat{y}$` is the **predicted** value of *y*

- `$b_{0}$` and `$b_{1}$` are **coefficients** (more on these later)

---

# One Numerical Explanatory Variable

---

# Motivating Example

What factors explain the differences in house prices in Washington state?

The `house_prices` dataset in the `moderndive` package contains data on a sample of 21,613 homes sold in King County, Washington between May 2014 and May 2015.

```r
View(house_prices)
```

**Outcome** variable (*y*): `price` (price of the house when sold, in USD)

**Explanatory** variable (*x*): `bedrooms` (number of bedrooms)

**"Research" question**: Could it be that more expensive homes have more bedrooms?!

---

# Summary Statistics

```r
house_prices %>%
  select(price, bedrooms) %>%
  summarize(mean_price = mean(price), sd_price = sd(price), 
            mean_bed = mean(bedrooms), sd_bed = sd(bedrooms))
```

```
## # A tibble: 1 × 4
## mean_price sd_price mean_bed sd_bed
## <dbl> <dbl> <dbl> <dbl>
## 1 540088. 367127. 3.37 0.930
```

---

# Summary Statistics

The summary statistics give us a snapshot at the *univariate* distribution for each variable:

- The **mean** house price is &#36;540,088.14 with a *standard deviation* of &#36;367,127.20. 
 - As you might imagine, this is a very **right-skewed** variable (median price is &#36;450,000). 
 
- The **mean** number of bedrooms per home is 3.37 with a *standard deviation of 0.93. 
 - This is actually a (roughly) symmetrical variable, save for the house with **THIRTY THREE** (33) bedrooms!!!
 
--

Note that these are all **univariate** summaries, i.e., summaries about *single variables*.

Let's review a statistic that quantifies the relationship *between* two variables.

---

# Correlation Coefficient

The **correlation coefficient** (*r*) is a *bivariate* summary statistic.

- summarizes the strength of the **linear** relationship between two *numerical* variables.

- ranges from -1 to 1

- -1 indicates a **perfect negative** *linear* relationship: as the value of one variable goes up, the value of the other variable tends to go down.

- 0 indicates **no linear relationship**: the values of both variables go up/down independently of each other.

- +1 indicates a **perfect positive** *linear* relationship: as the value of one variable goes up, the value of the other variable tends to go up as well.

---

# Correlation Coefficient in R

We can use the `get_correlation()` function from the `moderndive` package:

```r
house_prices %>%
  get_correlation(formula = price ~ bedrooms)
```

```
## # A tibble: 1 × 1
## cor
## <dbl>
## 1 0.308
```

- `$r=0.31$`: There is a *weak-to-moderate* **linear** relationship between house price and bedrooms per home.

**Reminder**: All the correlation coefficient shows us is the *strength* and *direction* of the *linear* relationship. **That's it**. 
- The 0.31 is *not* on the same scale as *x* or *y*.

---

# Data Visualization

Because `price` and `bedrooms` are both **numerical**, a **scatterplot** would be useful in visualizing their relationship.

```r
ggplot(data = house_prices, aes(x = bedrooms, y = price)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Bedrooms per home", y = "Price (in $)")
```

---

# Taking care of the outlier...

It is reasonable to suspect that the **outlier** with 33 bedrooms is *not representative* of the population in the same way that the rest of the sample is.

- Let's remove the outlier to see if `bedrooms` and `price` are more *linearly related*:

```r
house_prices = house_prices %>%
 filter(bedrooms < 33)
```

- This removes the outlier from the data.

---

# (New) Data Visualization

```r
*ggplot(data = house_prices, aes(x = bedrooms, y = price)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Bedrooms per home", y = "Price (in $)")
```

- Even after removing the outlier, there isn't a clear linear relationship between `price` and `bedrooms`.

---

# Simple Linear Regression

---

# Linear Model

---

# Non-linear Model

---

# Non-linear Model

---

# A (bad) Model

---

# Models

In statistics, a **model** is a summary and simplification of data that help our understanding in many ways.

A **linear model** uses sample data to generate a *line of best fit*...
- ...that is used to help our understanding of the linear relationship between `$x$` and `$y$`.

- Our model will be *wrong* (i.e., our line won't match reality *perfectly*).

- But hopefully, it is still useful!

---

# A Good Quote

- George Box, famous statistician
]

---

# Simple Linear Regression Model

A **simple linear regression model** follows the form of an *equation of a straight line*:
`$$\hat{y}=b_{0}+b_{1}\cdot x$$`

- The `$\hat{y}$` denotes the **predicted outcome variable**. 
    
- The **intercept coefficient** is `$b_{0}$`, or the value of `$\hat{y}$` when `$x=0$`.

- The **slope coefficient** is `$b_{1}$`, or the *average* change in `$\hat{y}$` for every one-unit increase in `$x$`.

---

# Regression Coefficients

.pull-right[
Because `$x=bedrooms$` and `$y=price$`, the regression equation is
`$$\widehat{price} = b_{0}+b_{1}\cdot bedrooms$$`
- Do you think the slope will be *positive* or *negative*?
]

---

# Regression Coefficients

But what are the *specific values* of the regression coefficients, `$b_{0}$` and `$b_{1}$`?

- Luckily, R can calculate these for us, by using the `lm()` function.

```r
lm_house = lm(price ~ bedrooms, data = house_prices)
get_regression_table(lm_house)
```

```
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 110316. 9108. 12.1 0 92462. 128169.
## 2 bedrooms 127548. 2610. 48.9 0 122432. 132664.
```

---

# The Estimated Linear Model

```r
lm_house = lm(price ~ bedrooms, data = house_prices)
get_regression_table(lm_house)
```

- The **intercept** coefficient is `$b_{0}=110316$`. 
- The **slope** coefficient is `$b_{1}=127548$`.

Therefore, `$$\widehat{price} = 110316 + 127548\cdot bedrooms$$`

---

# Interpreting the Regression Coefficients

The **intercept** `$b_{0}=110316$`.

- This means that the **predicted** price is &#36;110,316 for homes with 0 *bedrooms*. 
 - The intercept often doesn't make sense in context, but it does make sense here (e.g., studio apartments?).

The **slope** `$b_{1}=127548$`.

- This means that, for every additional `bedroom`, there is an associated increase of &#36;127,548 on the **predicted** price of the home. 
 - The slope is usually of more interest to us than the intercept.

---

# Predicting House Price

We can also use the equation of the linear model to **predict** the outcome (*y*) for a given value of *x*.

For example, let's predict the *price* of a home with *three bedrooms*: `$$\widehat{price} = 110316 + 127548\cdot bedrooms= 110316 + 127548\cdot 3=492960$$`

The **linear model** predicts that a house with *three bedrooms* will cost &#36;492,960.

---

# One Categorical Explanatory Variable

---

# Motivating Example

Do you think that **waterfront homes** are typically *more expensive* than **non-waterfront homes**?

---

# Practice

Using `house_prices`, perform *all steps from the regression analysis* of **bedrooms** (*x*) and **price** (*y*), but use `waterfront` as the *x* variable *instead*.

- What do you notice about how `lm()` reports information for a **categorical explanatory variable**?

---

# Summary Statistics

```r
house_prices %>%
  select(price, waterfront) %>%
* group_by(waterfront) %>%
  summarize(mean_price = mean(price), sd_price = sd(price))
```

```
## # A tibble: 2 × 3
## waterfront mean_price sd_price
## <lgl> <dbl> <dbl>
## 1 FALSE 531559. 341607.
## 2 TRUE 1661876. 1120372.
```

---

# Data Visualization

Because the *x* variable is **categorical**, a *boxplot* might be a useful visualization.

```r
ggplot(data = house_prices, aes(x = waterfront, y = price)) + 
  geom_boxplot() + 
  labs(x = "Waterfront home?", y = "Price (in $)") 
```

---

# Linear Regression Model

```r
lm_waterfront = lm(price ~ waterfront, data = house_prices)
get_regression_table(lm_waterfront)
```

```
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 531559. 2416. 220. 0 526822. 536295.
## 2 waterfrontTRUE 1130317. 27823. 40.6 0 1075782. 1184853.
```

`$$\widehat{price}=531564+1130312\cdot waterfrontTRUE$$`

Okay... what does `waterfrontTRUE` mean?!

---

# Dummy Variables

`$$\widehat{price}=531564+1130312\cdot waterfrontTRUE$$`

When using a **categorical explanatory variable** in a regression model, the *estimated coefficient* corresponds to the **difference in means** between:

- one level of the categorical explanatory variable, and
- the *reference level* of the categorical explanatory variable (usualy the level that comes first *alphabetically*)

Because the `waterfront` variable takes *two levels* (`TRUE` and `FALSE`), the *reference level* is `FALSE`.

- `$b_{1}=1130312$`: Homes with a *waterfront view* are *predicted* to cost, **on average**, &#36;1,130,312 more than *non-waterfront homes*.

- `$b_{0}=531564$`: Homes *without a waterfront view* (i.e., `waterfrontTRUE = 0`) are *predicted* to cost, **on average**, &#36;531,564.

---

# Back to Summary Statistics

```r
lm_waterfront = lm(price ~ waterfront, data = house_prices)
get_regression_table(lm_waterfront)
```

```r
house_prices %>%
  group_by(waterfront) %>%
  summarize(mean_price = mean(price))
```

```
## # A tibble: 2 × 2
## waterfront mean_price
## <lgl> <dbl>
## 1 FALSE 531559.
## 2 TRUE 1661876.
```

**We knew the regression coefficients the whole time!!!**