CEMA 0907: Statistics in the Real World

# CEMA 0907: Statistics in the Real World
## Multiple Regression
### Anthony Scotina

---

# Needed Packages

```r
library(tidyverse)
library(moderndive)

# Remove outlier
house_prices = house_prices %>%
  filter(bedrooms < 33)
```

---

# Multiple Regression

**Does bedrooms per house give us the full story?**

---

# Multiple Regression

**Does bedrooms per house give us the full story?**

- *No.*

---

# Multiple Regression

**So far**, we have looked at the following *linear regression models* using the `house_prices` data:

- `$\widehat{price} = 110316 + 127548*(bedrooms)$`
    - For every additional `bedroom`, there is an associated increase of, **on average**, <span>&#36;</span>127,548 on the price of the home.

- `$\widehat{price} = 531559 + 1130317*(waterfrontTRUE)$`
    - Homes with a *waterfront view* will cost, **on average**, <span>&#36;</span>1,130,317 more than *non-waterfront homes*.

**Why not all at once?**

- `$\widehat{price} = b_{0}+b_{1}*(bedrooms)+b_{2}*(waterfrontTRUE)$`

---

# Why multiple regression?

It is rare to have a single explanatory (*x*) variable in a study.

- A *more complex model* can frequently help with **prediction**.

- The **multiple regression model** is of the form `$$\hat{y}=b_{0}+b_{1}x_{1}+b_{2}x_{2}+\cdots+b_{k}x_{k}$$`
where there are `$k$` explanatory variables.

---

# One Categorical AND One Explanatory Variable

`$$\widehat{price} = b_{0}+b_{1}*(bedrooms)+b_{2}*(waterfrontTRUE)$$`

**Question**: Does the *association* between `price` and `bedrooms` change when *simultaneously* considering homes with a *waterfront view*?

In this **multiple regression** model, we have:

- A **numerical** response variable (*y*), the price of a home

- *Two* **explanatory variables**:
    - A *numerical explanatory variable*, `$x_{1}$`, the number of bedrooms per home
    - A *categorical explanatory variable*, `$x_{2}$`, whether or not the home has a waterfront view
    
---

# Data Visualization

- Both slopes are **positive**.

- But, the slope for *waterfront view homes* is **more positive**.

In other words, the average house price per bedroom is *higher* for *waterfront view homes*. 
]

---

# Interaction Model

We will quantify the relationship between the response *y* and the *two* explanatory variables using a type of multiple regression model known as an **interaction model**.

```r
lm_price_interaction = lm(price ~ bedrooms + waterfront + 
                            bedrooms*waterfront, 
                          data = house_prices)
get_regression_table(lm_price_interaction)
```

```
## # A tibble: 4 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                114143.     8725.     13.1    0       97042.  131245.
## 2 bedrooms                 123862.     2500.     49.5    0      118962.  128763.
## 3 waterfrontTRUE          -236296.    84425.     -2.80   0.005 -401775.  -70817.
## 4 bedrooms:waterfrontTRUE  416652.    24320.     17.1    0      368982.  464322.
```

In this **interaction model**, the model formula was not only of the form `y ~ x1 + x2`, but it included an **interaction term**, `x1*x2`.

---

# Interaction Model

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 114143.5 </td>
   <td style="text-align:right;"> 8724.783 </td>
   <td style="text-align:right;"> 97042.24 </td>
   <td style="text-align:right;"> 131244.68 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bedrooms </td>
   <td style="text-align:right;"> 123862.3 </td>
   <td style="text-align:right;"> 2500.081 </td>
   <td style="text-align:right;"> 118961.92 </td>
   <td style="text-align:right;"> 128762.61 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> waterfrontTRUE </td>
   <td style="text-align:right;"> -236296.0 </td>
   <td style="text-align:right;"> 84424.858 </td>
   <td style="text-align:right;"> -401774.98 </td>
   <td style="text-align:right;"> -70817.08 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bedrooms:waterfrontTRUE </td>
   <td style="text-align:right;"> 416652.0 </td>
   <td style="text-align:right;"> 24320.288 </td>
   <td style="text-align:right;"> 368982.41 </td>
   <td style="text-align:right;"> 464321.53 </td>
  </tr>
</tbody>
</table>

First, recall that *non-waterfront homes* (`waterfront = FALSE`) serves as the **reference** (or *baseline*) group. 
- Therefore, the `intercept` term is the **intercept** *for only the non-waterfront homes*.

- Similarly, the `bedrooms` term is the **slope** *for only the non-waterfront homes*.

What about the slope and intercept for *waterfront homes*?

---

# Linear Model Equation

We can write the equation of the linear model *with interaction term* as 
`\begin{align*}
\widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\
&\ \ \ \ +416652(bedrooms:waterfrontTRUE)
\end{align*}`

---

# Interaction Model

`\begin{align*}
\widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\
&\ \ \ \ +416652(bedrooms:waterfrontTRUE)
\end{align*}`

The value for `waterfrontTRUE` of `$-236296$` is *not* the intercept for waterfront homes.
- Rather, it serves as the **offset** in the intercept for waterfront homes *relative to non-waterfront homes*.

- The *intercept* for *waterfront homes* is `intercept` + `waterfrontTRUE` = `$114143-236296$` = `$-122153$`.

**Uhhh wut?**

- I thought waterfront homes were *more expensive*!!!

---

# The Interaction Term

`\begin{align*}
\widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\
&\ \ \ \ +416652(bedrooms:waterfrontTRUE)
\end{align*}`

The `bedrooms:waterfrontTRUE` is *not* the slope for age for waterfront homes. 
- Rather, it serves as the **offset** in the slope for waterfront homes.

The *slope* for *waterfront* homes is `bedrooms` + `bedrooms:waterfrontTRUE` = `$123862+416652$` = `$540514$`.

---

# Putting It All Together

`\begin{align*}
\widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\
&\ \ \ \ +416652(bedrooms:waterfrontTRUE)
\end{align*}`

Because there are *two levels* of the categorical explanatory variable (`TRUE` and `FALSE`), we can write the regression model as equations for *two separate regression lines*:

- For `waterfront = TRUE`, `$$\widehat{price}=-122153+540514(bedrooms)$$`
- For `waterfront = FALSE`, `$$\widehat{price}=114143+123862(bedrooms)$$`

---

# Interpreting the Regression Lines

For `waterfront = TRUE`, `$$\widehat{price}=-122153+540514(bedrooms)$$`

- For **homes with a waterfront view**: For every additional `bedroom`, there is an associated increase of, **on average**, <span>&#36;</span>540,514 on the price of the home.

For `waterfront = FALSE`, `$$\widehat{price}=114143+123862(bedrooms)$$`

- For **homes without a waterfront view**: For every additional `bedroom`, there is an associated increase of, **on average**, <span>&#36;</span>123,862 on the price of the home.

Thus, our model is suggesting that the number of bedrooms impacts the price of *waterfront homes* **more** than it does for *non-waterfront homes*.

---

# Summary of Interaction Model

`\begin{align*}
\widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\
&\ \ \ \ +416652(bedrooms:waterfrontTRUE)
\end{align*}`

- `$b_{0}=114143$` is the `intercept` for *non-waterfront homes*.

- `$b_{1}=123862$` is the slope for `bedrooms` for *non-waterfront homes*.

- `$b_{2}=-236296$` is the **offset** in the intercept for *waterfront homes*.

- `$b_{3}=416652$` is the **offset** in the slope for *waterfront homes*.

---

# Parallel Slopes Model

---

# Parallel Slopes Model

When creating **multiple regression models** with *one numerical* and *one categorical* explanatory variable, we are *not* limited to models with an **interaction effect**.

- Another type of model we can use is known as the **parallel slopes model**.

While **interaction models** can have regression lines with *different slopes and intercepts*, parallel slopes models *force* all lines to have the **same slope**.

---

# Visualizing Parallel Slopes

The `geom_parallel_slopes()` function in the `moderndive` package provides one way to plot a **parallel slopes model**.

```r
ggplot(house_prices, aes(x = bedrooms, y = price, color = waterfront)) + 
  geom_point() + 
  geom_parallel_slopes()
```

- **Note**: The *parallel regression lines* are not necessarily the **lines of best fit**!

---

# Fitting the Parallel Slopes Model

```r
lm_price_parallel = lm(price ~ bedrooms + waterfront, 
                       data = house_prices)
get_regression_table(lm_price_parallel)
```

```
## # A tibble: 3 × 7
##   term           estimate std_error statistic p_value lower_ci upper_ci
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept        99306.     8740.      11.4       0   82174.  116437.
## 2 bedrooms        128265.     2504.      51.2       0  123358.  133172.
## 3 waterfrontTRUE 1139217.    26274.      43.4       0 1087717. 1190716.
```

The equation for the linear model is: `$$\widehat{price}=99306+128265(bedrooms)+1139217(waterfrontTRUE)$$`

---

# Interpreting the Coefficients

`$$\widehat{price}=99306+128265(bedrooms)+1139217(waterfrontTRUE)$$`

The coefficients `$b_{0}=99306$` and `$b_{2}=1139217$` acts as they did in the *interaction model*:

- `$b_{0}=99306$` is the `intercept` for *non-waterfront homes*.
    - The **average** price is <span>&#36;</span>99,306 for *non-waterfront homes* with 0 *bedrooms*.

- `$b_{2}=1139217$` is the **offset** in the intercept for *waterfront homes*. 
    - The **average** price is <span>&#36;</span>99,306+<span>&#36;</span>1,139,217 = <span>&#36;</span>1,238,523 for *waterfront homes* with 0 *bedrooms*. 
    
--

Unlike in the *interaction model*, there is only one slope term in the *parallel slopes model*:

- `$b_{1}=128265$` is the slope for *waterfront* **and** *non-waterfront homes*. 
    - **Holding** `waterfront` **fixed** (i.e., for *one level of* `waterfront`): For every additional `bedroom`, there is an associated increase of, **on average**, <span>&#36;</span>128,265 on the price of the home.

---

# Parallel Slopes Model

`$$\widehat{price}=99306+128265(bedrooms)+1139217(waterfrontTRUE)$$`

For *waterfront homes*: `$$\widehat{price}=1238523+\underline{128265}(bedrooms)$$`

For *non-waterfront homes*: `$$\widehat{price}=99306+\underline{128265}(bedrooms)$$`

---

# Comparing the Models

So... why would we *ever* use a **parallel slopes model**?
- The lines in the left-hand plot don't look parallel, so why force them to be?
- We'll get back to this, but as a short answer: Sometimes **simple** is better!

---

# Two (or more) Numerical Explanatory Variables

---

# Two Numerical Explanatory Variables

Instead of examining a model with *one numerical* and *one categorical* explanatory, let's look at a model with **two numerical explanatory variables**.

Using the `house_prices` data:

- *y* = `price`

- `$x_{1}$` = `bedrooms`

- `$x_{2}$` = `sqft_living` (square footage of the home, from `?house_prices`)

In other words, we will fit the model: `$$\widehat{price} = b_{0}+b_{1}(bedrooms)+b_{2}(sqft.living)$$`

---

# Fitting the Model

To fit a model of this form in R, we use `lm()` exactly as we did in previous examples:

```r
lm_multiple = lm(price ~ bedrooms + sqft_living, data = house_prices)
get_regression_table(lm_multiple)
```

```
## # A tibble: 3 × 7
##   term        estimate std_error statistic p_value lower_ci upper_ci
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept     90034.   6733.        13.4       0   76836.  103231.
## 2 bedrooms     -62062.   2392.       -25.9       0  -66751.  -57372.
## 3 sqft_living     317.      2.37     134.        0     312.     322.
```

The equation of the *regression plane* is: `$$\widehat{price}=90034-62062(bedrooms)+317(sqft.living)$$`

---

# Interpreting the Regression Coefficients

`$$\widehat{price}=90034-62062(bedrooms)+317(sqft.living)$$`

- `$b_{0}=90034$`: The average price is <span>&#36;</span>90,034 for homes with *0 bedrooms* **AND** *0 square footage of space*. 
    - This doesn't make sense in context; using `sqft_living=0` is **extrapolation**!
    
--

- `$b_{1}=-62062$`: *Taking all other variables in our model into account* (i.e., holding them fixed), for every additional bedroom, there is an associated **decrease** of <span>&#36;</span>62,062 in price per home, on average.

- `$b_{2}=317$`: *Taking all other variables in our model into account*, for every additional square foot of living space, there is an associated **increase** of <span>&#36;</span>317 in price per home, on average.

---

# Interpreting the Regression Coefficients

To better understand what these interpretations mean, let's consider a **simple regression model** and a **multiple regression model** side-by-side:

```r
lm_simple = lm(price ~ bedrooms, data = house_prices)
get_regression_table(lm_simple)
```

```
## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept  110316.     9108.      12.1       0   92462.  128169.
## 2 bedrooms   127548.     2610.      48.9       0  122432.  132664.
```

```r
lm_multiple = lm(price ~ bedrooms + sqft_living, data = house_prices)
get_regression_table(lm_multiple)
```

---

# Model Selection

---

# Model Selection

Sometimes, **simpler** solutions are more likely to be *correct* than **complex** solutions.

Using `price` as the response:

- The **interaction model** was
`\begin{align*}
\widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\
&\ \ \ \ +416652(bedrooms:waterfrontTRUE)
\end{align*}`

- The **parallel slopes model** was `$$\widehat{price}=99306+128265(bedrooms)+1139217(waterfrontTRUE)$$`

The interaction model is more *complex* due to the extra ( `$b_{3}=416652$`) term.

- Is the *extra complexity* warranted?

- (Arguably, yes. But let's look at an example where it is less obvious...)

---

# `MA_schools`

The `MA_schools` dataset in the `moderndive` package contains data on MA public high schools in 2017.

```r
View(MA_schools)
?MA_schools
```

We will model `average_sat_math` (*y*) as a function of:

- `perc_disadvan` ( `$x_{1}$`): percent of study body considered "economically disadvantaged"

- `size` ( `$x_{2}$`): size of enrollment (`small`, `medium`, `large`)

---

# Comparing the Models

At least visually, the models don't appear very different!
- Now let's compare the **regression tables**.

```r
lm_mass_int = lm(average_sat_math ~ perc_disadvan + size + 
                   perc_disadvan*size, 
                 data = MA_schools)
get_regression_table(lm_mass_int)
```

```
## # A tibble: 6 × 7
##   term                    estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept                594.       13.3      44.7     0      568.     620.   
## 2 perc_disadvan             -2.93      0.294    -9.96    0       -3.51    -2.35 
## 3 size: medium             -17.8      15.8      -1.12    0.263  -48.9     13.4  
## 4 size: large              -13.3      13.8      -0.962   0.337  -40.5     13.9  
## 5 perc_disadvan:sizemedi…    0.146     0.371     0.393   0.694   -0.585    0.877
## 6 perc_disadvan:sizelarge    0.189     0.323     0.586   0.559   -0.446    0.824
```

---

# Interaction Model

The interaction model is 
`\begin{align*}
  \widehat{avg.sat.math}&=594-2.93(perc.disadvan)-17.8(size:medium)\\
  & \ \ \ \ -13.3(size:large)+0.146(perc.disadvan*size:medium)\\
  & \ \ \ \ +0.189(perc.disadvan*size:large)
\end{align*}`

---

# Paraellel Slopes Model

```r
lm_mass_par = lm(average_sat_math ~ perc_disadvan + size, data = MA_schools)
get_regression_table(lm_mass_par)
```

```
## # A tibble: 4 × 7
##   term          estimate std_error statistic p_value lower_ci upper_ci
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept       588.       7.61     77.3     0       573.     603.  
## 2 perc_disadvan    -2.78     0.106   -26.1     0        -2.99    -2.57
## 3 size: medium    -11.9      7.54     -1.58    0.115   -26.7      2.91
## 4 size: large      -6.36     6.92     -0.919   0.359   -20.0      7.26
```

The parallel slopes model is 
`\begin{align*}
  \widehat{avg.sat.math}&=594-2.78(perc.disadvan)\\
  &\ \ \ \ -11.9(size:medium)-6.36(size:large)
\end{align*}`

---

# Comparing the Models

Among other things, the **offsets** for the different slopes in the *interaction model* are very small relative to baseline.

- `$b_{3}=0.146$` means that the slope for *medium* schools is only 0.146 points higher than baseline (-2.93).

- `$b_{4}=0.189$` means that the slope for *large* schools is only 0.189 points higher than baseline (-2.93).

The *p*-values for those estimated coefficients are also small. 
- This is *beyond the scope of this class*, but the large *p*-values in the regression output mean something!

We just did a *very simple* form of **model selection**!