class: center, middle, inverse, title-slide # CEMA 0907: Statistics in the Real World ## Multiple Regression ### Anthony Scotina --- # Needed Packages ```r library(tidyverse) library(moderndive) # Remove outlier house_prices = house_prices %>% filter(bedrooms < 33) ``` --- # Multiple Regression **Does bedrooms per house give us the full story?** <img src="06-Multiple_Regression_files/figure-html/unnamed-chunk-4-1.png" width="50%" /> --- # Multiple Regression **Does bedrooms per house give us the full story?** - *No.* <img src="06-Multiple_Regression_files/figure-html/unnamed-chunk-5-1.png" width="50%" /> --- # Multiple Regression **So far**, we have looked at the following *linear regression models* using the `house_prices` data: - `\(\widehat{price} = 110316 + 127548*(bedrooms)\)` - For every additional `bedroom`, there is an associated increase of, **on average**, <span>$</span>127,548 on the price of the home. -- - `\(\widehat{price} = 531559 + 1130317*(waterfrontTRUE)\)` - Homes with a *waterfront view* will cost, **on average**, <span>$</span>1,130,317 more than *non-waterfront homes*. -- **Why not all at once?** - `\(\widehat{price} = b_{0}+b_{1}*(bedrooms)+b_{2}*(waterfrontTRUE)\)` --- # Why multiple regression? It is rare to have a single explanatory (*x*) variable in a study. - A *more complex model* can frequently help with **prediction**. - The **multiple regression model** is of the form `$$\hat{y}=b_{0}+b_{1}x_{1}+b_{2}x_{2}+\cdots+b_{k}x_{k}$$` where there are `\(k\)` explanatory variables. --- # One Categorical AND One Explanatory Variable `$$\widehat{price} = b_{0}+b_{1}*(bedrooms)+b_{2}*(waterfrontTRUE)$$` **Question**: Does the *association* between `price` and `bedrooms` change when *simultaneously* considering homes with a *waterfront view*? -- In this **multiple regression** model, we have: - A **numerical** response variable (*y*), the price of a home - *Two* **explanatory variables**: - A *numerical explanatory variable*, `\(x_{1}\)`, the number of bedrooms per home - A *categorical explanatory variable*, `\(x_{2}\)`, whether or not the home has a waterfront view --- # Data Visualization .pull-left[ ![](06-Multiple_Regression_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] .pull-right[ **Some observations** - Both slopes are **positive**. - But, the slope for *waterfront view homes* is **more positive**. In other words, the average house price per bedroom is *higher* for *waterfront view homes*. ] --- # Interaction Model We will quantify the relationship between the response *y* and the *two* explanatory variables using a type of multiple regression model known as an **interaction model**. ```r lm_price_interaction = lm(price ~ bedrooms + waterfront + bedrooms*waterfront, data = house_prices) get_regression_table(lm_price_interaction) ``` ``` ## # A tibble: 4 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 114143. 8725. 13.1 0 97042. 131245. ## 2 bedrooms 123862. 2500. 49.5 0 118962. 128763. ## 3 waterfrontTRUE -236296. 84425. -2.80 0.005 -401775. -70817. ## 4 bedrooms:waterfrontTRUE 416652. 24320. 17.1 0 368982. 464322. ``` In this **interaction model**, the model formula was not only of the form `y ~ x1 + x2`, but it included an **interaction term**, `x1*x2`. --- # Interaction Model <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 114143.5 </td> <td style="text-align:right;"> 8724.783 </td> <td style="text-align:right;"> 97042.24 </td> <td style="text-align:right;"> 131244.68 </td> </tr> <tr> <td style="text-align:left;"> bedrooms </td> <td style="text-align:right;"> 123862.3 </td> <td style="text-align:right;"> 2500.081 </td> <td style="text-align:right;"> 118961.92 </td> <td style="text-align:right;"> 128762.61 </td> </tr> <tr> <td style="text-align:left;"> waterfrontTRUE </td> <td style="text-align:right;"> -236296.0 </td> <td style="text-align:right;"> 84424.858 </td> <td style="text-align:right;"> -401774.98 </td> <td style="text-align:right;"> -70817.08 </td> </tr> <tr> <td style="text-align:left;"> bedrooms:waterfrontTRUE </td> <td style="text-align:right;"> 416652.0 </td> <td style="text-align:right;"> 24320.288 </td> <td style="text-align:right;"> 368982.41 </td> <td style="text-align:right;"> 464321.53 </td> </tr> </tbody> </table> First, recall that *non-waterfront homes* (`waterfront = FALSE`) serves as the **reference** (or *baseline*) group. - Therefore, the `intercept` term is the **intercept** *for only the non-waterfront homes*. - Similarly, the `bedrooms` term is the **slope** *for only the non-waterfront homes*. What about the slope and intercept for *waterfront homes*? --- # Linear Model Equation <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 114143.5 </td> <td style="text-align:right;"> 8724.783 </td> <td style="text-align:right;"> 97042.24 </td> <td style="text-align:right;"> 131244.68 </td> </tr> <tr> <td style="text-align:left;"> bedrooms </td> <td style="text-align:right;"> 123862.3 </td> <td style="text-align:right;"> 2500.081 </td> <td style="text-align:right;"> 118961.92 </td> <td style="text-align:right;"> 128762.61 </td> </tr> <tr> <td style="text-align:left;"> waterfrontTRUE </td> <td style="text-align:right;"> -236296.0 </td> <td style="text-align:right;"> 84424.858 </td> <td style="text-align:right;"> -401774.98 </td> <td style="text-align:right;"> -70817.08 </td> </tr> <tr> <td style="text-align:left;"> bedrooms:waterfrontTRUE </td> <td style="text-align:right;"> 416652.0 </td> <td style="text-align:right;"> 24320.288 </td> <td style="text-align:right;"> 368982.41 </td> <td style="text-align:right;"> 464321.53 </td> </tr> </tbody> </table> We can write the equation of the linear model *with interaction term* as `\begin{align*} \widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\ &\ \ \ \ +416652(bedrooms:waterfrontTRUE) \end{align*}` --- # Interaction Model `\begin{align*} \widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\ &\ \ \ \ +416652(bedrooms:waterfrontTRUE) \end{align*}` The value for `waterfrontTRUE` of `\(-236296\)` is *not* the intercept for waterfront homes. - Rather, it serves as the **offset** in the intercept for waterfront homes *relative to non-waterfront homes*. - The *intercept* for *waterfront homes* is `intercept` + `waterfrontTRUE` = `\(114143-236296\)` = `\(-122153\)`. -- **Uhhh wut?** - I thought waterfront homes were *more expensive*!!! --- # The Interaction Term `\begin{align*} \widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\ &\ \ \ \ +416652(bedrooms:waterfrontTRUE) \end{align*}` The `bedrooms:waterfrontTRUE` is *not* the slope for age for waterfront homes. - Rather, it serves as the **offset** in the slope for waterfront homes. The *slope* for *waterfront* homes is `bedrooms` + `bedrooms:waterfrontTRUE` = `\(123862+416652\)` = `\(540514\)`. --- # Putting It All Together `\begin{align*} \widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\ &\ \ \ \ +416652(bedrooms:waterfrontTRUE) \end{align*}` Because there are *two levels* of the categorical explanatory variable (`TRUE` and `FALSE`), we can write the regression model as equations for *two separate regression lines*: - For `waterfront = TRUE`, `$$\widehat{price}=-122153+540514(bedrooms)$$` - For `waterfront = FALSE`, `$$\widehat{price}=114143+123862(bedrooms)$$` --- # Interpreting the Regression Lines For `waterfront = TRUE`, `$$\widehat{price}=-122153+540514(bedrooms)$$` - For **homes with a waterfront view**: For every additional `bedroom`, there is an associated increase of, **on average**, <span>$</span>540,514 on the price of the home. -- For `waterfront = FALSE`, `$$\widehat{price}=114143+123862(bedrooms)$$` - For **homes without a waterfront view**: For every additional `bedroom`, there is an associated increase of, **on average**, <span>$</span>123,862 on the price of the home. -- Thus, our model is suggesting that the number of bedrooms impacts the price of *waterfront homes* **more** than it does for *non-waterfront homes*. --- # Summary of Interaction Model `\begin{align*} \widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\ &\ \ \ \ +416652(bedrooms:waterfrontTRUE) \end{align*}` - `\(b_{0}=114143\)` is the `intercept` for *non-waterfront homes*. - `\(b_{1}=123862\)` is the slope for `bedrooms` for *non-waterfront homes*. - `\(b_{2}=-236296\)` is the **offset** in the intercept for *waterfront homes*. - `\(b_{3}=416652\)` is the **offset** in the slope for *waterfront homes*. --- class: center, middle # Parallel Slopes Model --- # Parallel Slopes Model When creating **multiple regression models** with *one numerical* and *one categorical* explanatory variable, we are *not* limited to models with an **interaction effect**. - Another type of model we can use is known as the **parallel slopes model**. -- While **interaction models** can have regression lines with *different slopes and intercepts*, parallel slopes models *force* all lines to have the **same slope**. --- # Visualizing Parallel Slopes The `geom_parallel_slopes()` function in the `moderndive` package provides one way to plot a **parallel slopes model**. ```r ggplot(house_prices, aes(x = bedrooms, y = price, color = waterfront)) + geom_point() + geom_parallel_slopes() ``` <img src="06-Multiple_Regression_files/figure-html/unnamed-chunk-10-1.png" width="35%" /> - **Note**: The *parallel regression lines* are not necessarily the **lines of best fit**! --- # Fitting the Parallel Slopes Model ```r lm_price_parallel = lm(price ~ bedrooms + waterfront, data = house_prices) get_regression_table(lm_price_parallel) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 99306. 8740. 11.4 0 82174. 116437. ## 2 bedrooms 128265. 2504. 51.2 0 123358. 133172. ## 3 waterfrontTRUE 1139217. 26274. 43.4 0 1087717. 1190716. ``` The equation for the linear model is: `$$\widehat{price}=99306+128265(bedrooms)+1139217(waterfrontTRUE)$$` --- # Interpreting the Coefficients `$$\widehat{price}=99306+128265(bedrooms)+1139217(waterfrontTRUE)$$` The coefficients `\(b_{0}=99306\)` and `\(b_{2}=1139217\)` acts as they did in the *interaction model*: - `\(b_{0}=99306\)` is the `intercept` for *non-waterfront homes*. - The **average** price is <span>$</span>99,306 for *non-waterfront homes* with 0 *bedrooms*. - `\(b_{2}=1139217\)` is the **offset** in the intercept for *waterfront homes*. - The **average** price is <span>$</span>99,306+<span>$</span>1,139,217 = <span>$</span>1,238,523 for *waterfront homes* with 0 *bedrooms*. -- Unlike in the *interaction model*, there is only one slope term in the *parallel slopes model*: - `\(b_{1}=128265\)` is the slope for *waterfront* **and** *non-waterfront homes*. - **Holding** `waterfront` **fixed** (i.e., for *one level of* `waterfront`): For every additional `bedroom`, there is an associated increase of, **on average**, <span>$</span>128,265 on the price of the home. --- # Parallel Slopes Model `$$\widehat{price}=99306+128265(bedrooms)+1139217(waterfrontTRUE)$$` For *waterfront homes*: `$$\widehat{price}=1238523+\underline{128265}(bedrooms)$$` For *non-waterfront homes*: `$$\widehat{price}=99306+\underline{128265}(bedrooms)$$` --- # Comparing the Models <img src="06-Multiple_Regression_files/figure-html/unnamed-chunk-12-1.png" width="50%" /><img src="06-Multiple_Regression_files/figure-html/unnamed-chunk-12-2.png" width="50%" /> So... why would we *ever* use a **parallel slopes model**? - The lines in the left-hand plot don't look parallel, so why force them to be? - We'll get back to this, but as a short answer: Sometimes **simple** is better! --- class: center, middle # Two (or more) Numerical Explanatory Variables --- # Two Numerical Explanatory Variables Instead of examining a model with *one numerical* and *one categorical* explanatory, let's look at a model with **two numerical explanatory variables**. Using the `house_prices` data: - *y* = `price` - `\(x_{1}\)` = `bedrooms` - `\(x_{2}\)` = `sqft_living` (square footage of the home, from `?house_prices`) In other words, we will fit the model: `$$\widehat{price} = b_{0}+b_{1}(bedrooms)+b_{2}(sqft.living)$$` --- # Fitting the Model To fit a model of this form in R, we use `lm()` exactly as we did in previous examples: ```r lm_multiple = lm(price ~ bedrooms + sqft_living, data = house_prices) get_regression_table(lm_multiple) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 90034. 6733. 13.4 0 76836. 103231. ## 2 bedrooms -62062. 2392. -25.9 0 -66751. -57372. ## 3 sqft_living 317. 2.37 134. 0 312. 322. ``` The equation of the *regression plane* is: `$$\widehat{price}=90034-62062(bedrooms)+317(sqft.living)$$` --- # Interpreting the Regression Coefficients `$$\widehat{price}=90034-62062(bedrooms)+317(sqft.living)$$` - `\(b_{0}=90034\)`: The average price is <span>$</span>90,034 for homes with *0 bedrooms* **AND** *0 square footage of space*. - This doesn't make sense in context; using `sqft_living=0` is **extrapolation**! -- - `\(b_{1}=-62062\)`: *Taking all other variables in our model into account* (i.e., holding them fixed), for every additional bedroom, there is an associated **decrease** of <span>$</span>62,062 in price per home, on average. -- - `\(b_{2}=317\)`: *Taking all other variables in our model into account*, for every additional square foot of living space, there is an associated **increase** of <span>$</span>317 in price per home, on average. --- # Interpreting the Regression Coefficients To better understand what these interpretations mean, let's consider a **simple regression model** and a **multiple regression model** side-by-side: ```r lm_simple = lm(price ~ bedrooms, data = house_prices) get_regression_table(lm_simple) ``` ``` ## # A tibble: 2 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 110316. 9108. 12.1 0 92462. 128169. ## 2 bedrooms 127548. 2610. 48.9 0 122432. 132664. ``` ```r lm_multiple = lm(price ~ bedrooms + sqft_living, data = house_prices) get_regression_table(lm_multiple) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 90034. 6733. 13.4 0 76836. 103231. ## 2 bedrooms -62062. 2392. -25.9 0 -66751. -57372. ## 3 sqft_living 317. 2.37 134. 0 312. 322. ``` --- class: center, middle # Model Selection --- # Model Selection Sometimes, **simpler** solutions are more likely to be *correct* than **complex** solutions. Using `price` as the response: - The **interaction model** was `\begin{align*} \widehat{price}&=114143+123862(bedrooms)-236296(waterfrontTRUE)\\ &\ \ \ \ +416652(bedrooms:waterfrontTRUE) \end{align*}` - The **parallel slopes model** was `$$\widehat{price}=99306+128265(bedrooms)+1139217(waterfrontTRUE)$$` -- The interaction model is more *complex* due to the extra ( `\(b_{3}=416652\)`) term. - Is the *extra complexity* warranted? - (Arguably, yes. But let's look at an example where it is less obvious...) --- # `MA_schools` The `MA_schools` dataset in the `moderndive` package contains data on MA public high schools in 2017. ```r View(MA_schools) ?MA_schools ``` We will model `average_sat_math` (*y*) as a function of: - `perc_disadvan` ( `\(x_{1}\)`): percent of study body considered "economically disadvantaged" - `size` ( `\(x_{2}\)`): size of enrollment (`small`, `medium`, `large`) --- # Comparing the Models At least visually, the models don't appear very different! - Now let's compare the **regression tables**. -- ```r lm_mass_int = lm(average_sat_math ~ perc_disadvan + size + perc_disadvan*size, data = MA_schools) get_regression_table(lm_mass_int) ``` ``` ## # A tibble: 6 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 594. 13.3 44.7 0 568. 620. ## 2 perc_disadvan -2.93 0.294 -9.96 0 -3.51 -2.35 ## 3 size: medium -17.8 15.8 -1.12 0.263 -48.9 13.4 ## 4 size: large -13.3 13.8 -0.962 0.337 -40.5 13.9 ## 5 perc_disadvan:sizemedi… 0.146 0.371 0.393 0.694 -0.585 0.877 ## 6 perc_disadvan:sizelarge 0.189 0.323 0.586 0.559 -0.446 0.824 ``` --- # Interaction Model ``` ## # A tibble: 6 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 594. 13.3 44.7 0 568. 620. ## 2 perc_disadvan -2.93 0.294 -9.96 0 -3.51 -2.35 ## 3 size: medium -17.8 15.8 -1.12 0.263 -48.9 13.4 ## 4 size: large -13.3 13.8 -0.962 0.337 -40.5 13.9 ## 5 perc_disadvan:sizemedi… 0.146 0.371 0.393 0.694 -0.585 0.877 ## 6 perc_disadvan:sizelarge 0.189 0.323 0.586 0.559 -0.446 0.824 ``` The interaction model is `\begin{align*} \widehat{avg.sat.math}&=594-2.93(perc.disadvan)-17.8(size:medium)\\ & \ \ \ \ -13.3(size:large)+0.146(perc.disadvan*size:medium)\\ & \ \ \ \ +0.189(perc.disadvan*size:large) \end{align*}` --- # Paraellel Slopes Model ```r lm_mass_par = lm(average_sat_math ~ perc_disadvan + size, data = MA_schools) get_regression_table(lm_mass_par) ``` ``` ## # A tibble: 4 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 588. 7.61 77.3 0 573. 603. ## 2 perc_disadvan -2.78 0.106 -26.1 0 -2.99 -2.57 ## 3 size: medium -11.9 7.54 -1.58 0.115 -26.7 2.91 ## 4 size: large -6.36 6.92 -0.919 0.359 -20.0 7.26 ``` The parallel slopes model is `\begin{align*} \widehat{avg.sat.math}&=594-2.78(perc.disadvan)\\ &\ \ \ \ -11.9(size:medium)-6.36(size:large) \end{align*}` --- # Comparing the Models Among other things, the **offsets** for the different slopes in the *interaction model* are very small relative to baseline. - `\(b_{3}=0.146\)` means that the slope for *medium* schools is only 0.146 points higher than baseline (-2.93). - `\(b_{4}=0.189\)` means that the slope for *large* schools is only 0.189 points higher than baseline (-2.93). -- The *p*-values for those estimated coefficients are also small. - This is *beyond the scope of this class*, but the large *p*-values in the regression output mean something! -- We just did a *very simple* form of **model selection**!