class: center, middle, inverse, title-slide # Statistical Thinking using Randomisation and Simulation ## Linear models ### Di Cook (
dicook@monash.edu
,
@visnut
) ### W4.C2 --- # Overview of this class - Fitting a linear model to olympic medal tally - Review of linear regression --- # Modeling Olympic medal counts How does medal count in 2016 associate with that from the previous Olympics and the country's population and GDP? ![](week4.class2_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- # Model fit summary `$$M_{2016} = \beta_0 + \beta_1 M_{2012} + \beta_2 Population + \beta_3 GDP + \varepsilon$$` ``` #> term estimate std.error statistic p.value #> 1 (Intercept) 1.8604 0.49070 3.8 2.9e-04 #> 2 Total_2012 0.7471 0.04108 18.2 1.6e-30 #> 3 Population_mil -0.0260 0.00384 -6.8 1.7e-09 #> 4 GDP_PPP_bil 0.0024 0.00038 6.4 8.4e-09 ``` ``` #> null.deviance df.null logLik AIC BIC deviance df.residual #> 1 28518 85 -235 480 492 1192 82 ``` --- # Fit and residuals <img src="week4.class2_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Make plots interactive
--- # Make plots interactive
--- # Simple linear model The model, `$$Y = \beta_0 + \beta_1 X + \varepsilon$$` - Explains how response variable, `\(Y\)`, changes in relation to explanatory variable, `\(X\)`, on average. - Use line to predict value of `\(Y\)` for a given value of `\(X\)` <img src="week4.class2_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Observed, fitted, residuals - Observed value is `\(Y\)` (a point on plot) - Fitted value is `\(\hat{Y}\)`, a value that lies on the line - Residual is the difference between observed and fitted, `\(e=Y-\hat{Y}\)` ![](regression.png) --- # Fitting process - Minimizing the sum of squared residuals produces the best fitting line. - Minimizes `\(\sum e^2\)` - Line that is closest to the points, as a whole. --- # Parameter interpretation - Line of best fit: `\(\hat{Y}=b_0+b_1X\)` - `\(b_0\)` is the intercept of the line with y-axis - `\(b_1\)` is the slope of the line --- # Calculating manually Given standard deviation of `\(X\)`, `\(s_x\)`, standard deviation of `\(Y\)`, `\(s_y\)`, and the correlation, `\(r\)`, between the two, the slope is computed by `$$b_1 = r\frac{s_y}{s_x}$$` and given the sample means `\(\bar{X}, \bar{Y}\)` `$$b_0 = \bar{Y} - b_1\bar{X}$$` --- class: inverse middle # YOUR TURN (Complete questions online) - Is the point `\(\bar{X}, \bar{Y}\)` on the regression line? --- # Prediction For given `\(X\)` values, plug these into the model equation to predict `\(Y\)`, `$$\hat{Y} = b_0 + b_1 X$$` --- # Goodness of fit - `\(R^2\)` is the proportion of variation in `\(Y\)` that is explained by `\(X\)`. Computed by $$ R^2 = 1- \frac{\sum e^2}{\sum Y^2} $$ - __Deviance__: up to a constant, minus twice the maximized log-likelihood. It is the modern analog to residual sum of squares, that measures the relative merits of two models. --- # Reading residual plots - Make a histogram and normal probability plot of the residuals - for a good fit the shape should be pretty symmetric and bell-shaped - Plot the residuals against the fitted values - for a good fit should be just a random splatter, no patterns --- # Residual plots ![](residuals.png) --- # More diagnostics - __Influential points__: leverage (diagonal elements of hat matrix, values `\(> 2p/n\)` would indicate cases with high influence), cooksd (Cooks distance, measures the change in the residual when the case is removed) - __Collinearity__ between explanatory variables (multiple regression): variance inflation factor ![](week4.class2_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- # Cautions - Association is not causation - Linear association only - Extrapolation outside the range of the data is not recommended --- # Anscombe's quartet ![](anscombe.png) Always plot the data, because very different patterns can lead to the same correlation. --- # Resources - [Statistics online textbook, Diez, Barr, Cetinkaya-Rundel](https://www.openintro.org/stat/textbook.php?stat_book=isrs) - [Ancombe's quartet](http://en.wikipedia.org/wiki/Anscombe's_quartet) --- class: inverse middle # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.