class: center, middle, inverse, title-slide # CEMA 0907: Statistics in the Real World ## Basic Regression ### Anthony Scotina --- # Needed Packages ```r library(tidyverse) library(moderndive) library(palmerpenguins) # Install this! ``` --- # Linear Regression There are *many* ways to model data. For the rest of this class, we will focus on **linear regression**. **Linear regression modeling** involves: - a **numerical** outcome variable `\(y\)`, and - explanatory variable(s) `\(x\)` that are either *numerical* or *categorical* -- The **model** follows this form: `$$\hat{y}=b_{0}+b_{1}\cdot x$$` where: - `\(\hat{y}\)` is the **predicted** value of *y* - `\(b_{0}\)` and `\(b_{1}\)` are **coefficients** (more on these later) --- class: center, middle # One Numerical Explanatory Variable --- # Motivating Example What factors explain the differences in house prices in Washington state? The `house_prices` dataset in the `moderndive` package contains data on a sample of 21,613 homes sold in King County, Washington between May 2014 and May 2015. ```r View(house_prices) ``` -- **Outcome** variable (*y*): `price` (price of the house when sold, in USD) **Explanatory** variable (*x*): `bedrooms` (number of bedrooms) -- **"Research" question**: Could it be that more expensive homes have more bedrooms?! --- # Summary Statistics ```r house_prices %>% select(price, bedrooms) %>% summarize(mean_price = mean(price), sd_price = sd(price), mean_bed = mean(bedrooms), sd_bed = sd(bedrooms)) ``` ``` ## # A tibble: 1 × 4 ## mean_price sd_price mean_bed sd_bed ## <dbl> <dbl> <dbl> <dbl> ## 1 540088. 367127. 3.37 0.930 ``` --- # Summary Statistics The summary statistics give us a snapshot at the *univariate* distribution for each variable: - The **mean** house price is <span>$</span>540,088.14 with a *standard deviation* of <span>$</span>367,127.20. - As you might imagine, this is a very **right-skewed** variable (median price is <span>$</span>450,000). - The **mean** number of bedrooms per home is 3.37 with a *standard deviation of 0.93. - This is actually a (roughly) symmetrical variable, save for the house with **THIRTY THREE** (33) bedrooms!!! -- Note that these are all **univariate** summaries, i.e., summaries about *single variables*. Let's review a statistic that quantifies the relationship *between* two variables. --- # Correlation Coefficient The **correlation coefficient** (*r*) is a *bivariate* summary statistic. - summarizes the strength of the **linear** relationship between two *numerical* variables. - ranges from -1 to 1 -- - -1 indicates a **perfect negative** *linear* relationship: as the value of one variable goes up, the value of the other variable tends to go down. - 0 indicates **no linear relationship**: the values of both variables go up/down independently of each other. - +1 indicates a **perfect positive** *linear* relationship: as the value of one variable goes up, the value of the other variable tends to go up as well. --- # Correlation Coefficient in R We can use the `get_correlation()` function from the `moderndive` package: ```r house_prices %>% get_correlation(formula = price ~ bedrooms) ``` ``` ## # A tibble: 1 × 1 ## cor ## <dbl> ## 1 0.308 ``` - `\(r=0.31\)`: There is a *weak-to-moderate* **linear** relationship between house price and bedrooms per home. **Reminder**: All the correlation coefficient shows us is the *strength* and *direction* of the *linear* relationship. **That's it**. - The 0.31 is *not* on the same scale as *x* or *y*. --- # Data Visualization Because `price` and `bedrooms` are both **numerical**, a **scatterplot** would be useful in visualizing their relationship. ```r ggplot(data = house_prices, aes(x = bedrooms, y = price)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Bedrooms per home", y = "Price (in $)") ``` <img src="05-Simple_Regression_files/figure-html/unnamed-chunk-6-1.png" width="40%" /> --- # Taking care of the outlier... It is reasonable to suspect that the **outlier** with 33 bedrooms is *not representative* of the population in the same way that the rest of the sample is. - Let's remove the outlier to see if `bedrooms` and `price` are more *linearly related*: ```r house_prices = house_prices %>% filter(bedrooms < 33) ``` - This removes the outlier from the data. --- # (New) Data Visualization ```r *ggplot(data = house_prices, aes(x = bedrooms, y = price)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Bedrooms per home", y = "Price (in $)") ``` <img src="05-Simple_Regression_files/figure-html/unnamed-chunk-8-1.png" width="45%" /> - Even after removing the outlier, there isn't a clear linear relationship between `price` and `bedrooms`. --- class: center, middle # Simple Linear Regression --- # Linear Model <img src="05-Simple_Regression_files/figure-html/unnamed-chunk-9-1.png" width="45%" /> --- # Non-linear Model <img src="05-Simple_Regression_files/figure-html/unnamed-chunk-10-1.png" width="45%" /> --- # Non-linear Model <img src="05-Simple_Regression_files/figure-html/unnamed-chunk-11-1.png" width="45%" /> --- # A (bad) Model <img src="05-Simple_Regression_files/figure-html/unnamed-chunk-12-1.png" width="45%" /> --- # Models In statistics, a **model** is a summary and simplification of data that help our understanding in many ways. A **linear model** uses sample data to generate a *line of best fit*... - ...that is used to help our understanding of the linear relationship between `\(x\)` and `\(y\)`. - Our model will be *wrong* (i.e., our line won't match reality *perfectly*). - But hopefully, it is still useful! --- # A Good Quote .pull-left[ > *All models are wrong, but some are useful.* - George Box, famous statistician ] .pull-right[ <img src="box.jpg" width="75%" /> ] --- # Simple Linear Regression Model A **simple linear regression model** follows the form of an *equation of a straight line*: `$$\hat{y}=b_{0}+b_{1}\cdot x$$` - The `\(\hat{y}\)` denotes the **predicted outcome variable**. - The **intercept coefficient** is `\(b_{0}\)`, or the value of `\(\hat{y}\)` when `\(x=0\)`. - The **slope coefficient** is `\(b_{1}\)`, or the *average* change in `\(\hat{y}\)` for every one-unit increase in `\(x\)`. --- # Regression Coefficients .pull-left[ ![](05-Simple_Regression_files/figure-html/unnamed-chunk-14-1.png)<!-- --> ] .pull-right[ Because `\(x=bedrooms\)` and `\(y=price\)`, the regression equation is `$$\widehat{price} = b_{0}+b_{1}\cdot bedrooms$$` - Do you think the slope will be *positive* or *negative*? ] --- # Regression Coefficients But what are the *specific values* of the regression coefficients, `\(b_{0}\)` and `\(b_{1}\)`? - Luckily, R can calculate these for us, by using the `lm()` function. ```r lm_house = lm(price ~ bedrooms, data = house_prices) get_regression_table(lm_house) ``` ``` ## # A tibble: 2 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 110316. 9108. 12.1 0 92462. 128169. ## 2 bedrooms 127548. 2610. 48.9 0 122432. 132664. ``` --- # The Estimated Linear Model ```r lm_house = lm(price ~ bedrooms, data = house_prices) get_regression_table(lm_house) ``` ``` ## # A tibble: 2 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 110316. 9108. 12.1 0 92462. 128169. ## 2 bedrooms 127548. 2610. 48.9 0 122432. 132664. ``` - The **intercept** coefficient is `\(b_{0}=110316\)`. - The **slope** coefficient is `\(b_{1}=127548\)`. Therefore, `$$\widehat{price} = 110316 + 127548\cdot bedrooms$$` --- # Interpreting the Regression Coefficients The **intercept** `\(b_{0}=110316\)`. - This means that the **predicted** price is <span>$</span>110,316 for homes with 0 *bedrooms*. - The intercept often doesn't make sense in context, but it does make sense here (e.g., studio apartments?). -- The **slope** `\(b_{1}=127548\)`. - This means that, for every additional `bedroom`, there is an associated increase of <span>$</span>127,548 on the **predicted** price of the home. - The slope is usually of more interest to us than the intercept. --- # Predicting House Price We can also use the equation of the linear model to **predict** the outcome (*y*) for a given value of *x*. For example, let's predict the *price* of a home with *three bedrooms*: `$$\widehat{price} = 110316 + 127548\cdot bedrooms= 110316 + 127548\cdot 3=492960$$` -- The **linear model** predicts that a house with *three bedrooms* will cost <span>$</span>492,960. --- class: center, middle # One Categorical Explanatory Variable --- # Motivating Example Do you think that **waterfront homes** are typically *more expensive* than **non-waterfront homes**? -- .pull-left[ **Waterfront Home** <img src="waterfront_seattle.jpg" width="853" /> ] .pull-right[ **Non-waterfront Home** <img src="nonwaterfront.jpg" width="733" /> ] --- # Practice Using `house_prices`, perform *all steps from the regression analysis* of **bedrooms** (*x*) and **price** (*y*), but use `waterfront` as the *x* variable *instead*. - What do you notice about how `lm()` reports information for a **categorical explanatory variable**? --- # Summary Statistics ```r house_prices %>% select(price, waterfront) %>% * group_by(waterfront) %>% summarize(mean_price = mean(price), sd_price = sd(price)) ``` ``` ## # A tibble: 2 × 3 ## waterfront mean_price sd_price ## <lgl> <dbl> <dbl> ## 1 FALSE 531559. 341607. ## 2 TRUE 1661876. 1120372. ``` --- # Data Visualization Because the *x* variable is **categorical**, a *boxplot* might be a useful visualization. ```r ggplot(data = house_prices, aes(x = waterfront, y = price)) + geom_boxplot() + labs(x = "Waterfront home?", y = "Price (in $)") ``` <img src="05-Simple_Regression_files/figure-html/unnamed-chunk-20-1.png" width="45%" /> --- # Linear Regression Model ```r lm_waterfront = lm(price ~ waterfront, data = house_prices) get_regression_table(lm_waterfront) ``` ``` ## # A tibble: 2 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 531559. 2416. 220. 0 526822. 536295. ## 2 waterfrontTRUE 1130317. 27823. 40.6 0 1075782. 1184853. ``` `$$\widehat{price}=531564+1130312\cdot waterfrontTRUE$$` -- Okay... what does `waterfrontTRUE` mean?! --- # Dummy Variables `$$\widehat{price}=531564+1130312\cdot waterfrontTRUE$$` When using a **categorical explanatory variable** in a regression model, the *estimated coefficient* corresponds to the **difference in means** between: - one level of the categorical explanatory variable, and - the *reference level* of the categorical explanatory variable (usualy the level that comes first *alphabetically*) -- Because the `waterfront` variable takes *two levels* (`TRUE` and `FALSE`), the *reference level* is `FALSE`. - `\(b_{1}=1130312\)`: Homes with a *waterfront view* are *predicted* to cost, **on average**, <span>$</span>1,130,312 more than *non-waterfront homes*. - `\(b_{0}=531564\)`: Homes *without a waterfront view* (i.e., `waterfrontTRUE = 0`) are *predicted* to cost, **on average**, <span>$</span>531,564. --- # Back to Summary Statistics ```r lm_waterfront = lm(price ~ waterfront, data = house_prices) get_regression_table(lm_waterfront) ``` ``` ## # A tibble: 2 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 531559. 2416. 220. 0 526822. 536295. ## 2 waterfrontTRUE 1130317. 27823. 40.6 0 1075782. 1184853. ``` ```r house_prices %>% group_by(waterfront) %>% summarize(mean_price = mean(price)) ``` ``` ## # A tibble: 2 × 2 ## waterfront mean_price ## <lgl> <dbl> ## 1 FALSE 531559. ## 2 TRUE 1661876. ``` **We knew the regression coefficients the whole time!!!**