class: center, middle, inverse, title-slide # Statistical Thinking using Randomisation and Simulation ## Linear models: model choice ### W5.C2 --- # Data Olympic medal tallies: - Response: M2016 - Explanatory variables: M2012, M2008, Population_mil, GDP_PPP_bil --- class: inverse middle # Your turn What are the possible models? How many are there? -- 15 --- --- # Check the data <img src="week5.class2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Log transform everything <img src="week5.class2_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # Fit all possible models ```r library(meifly) oly_gdp_2016_12_08_models <- fitall(oly_gdp_2016_12_08$M2016, oly_gdp_2016_12_08[,3:6], "lm") ``` ```r summary(oly_gdp_2016_12_08_models[[1]]) #> #> Call: #> lm(formula = y ~ Population_mil, data = data, model = FALSE) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.9088 -0.3030 0.0076 0.2880 0.8769 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 0.4600 0.0841 5.47 4.5e-07 *** #> Population_mil 0.2989 0.0596 5.02 2.9e-06 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.4 on 84 degrees of freedom #> Multiple R-squared: 0.231, Adjusted R-squared: 0.222 #> F-statistic: 25.2 on 1 and 84 DF, p-value: 2.87e-06 ``` --- # Second model fit ```r summary(oly_gdp_2016_12_08_models[[2]]) #> #> Call: #> lm(formula = y ~ GDP_PPP_bil, data = data, model = FALSE) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.8247 -0.2991 0.0087 0.3204 0.6898 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -0.00934 0.13317 -0.07 0.94 #> GDP_PPP_bil 0.33608 0.05133 6.55 4.4e-09 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.37 on 84 degrees of freedom #> Multiple R-squared: 0.338, Adjusted R-squared: 0.33 #> F-statistic: 42.9 on 1 and 84 DF, p-value: 4.4e-09 ``` --- # Model fit summaries - Extract the model fit statistics, adjusted `\(R^2\)`, `\(R^2\)`, AIC, BIC, for each model, and display these against the degrees of freedom - Flip AIC and BIC, so that the direction matches other statistics - Helps choose best model - How different is the best model from the next best model --- <img src="week5.class2_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # What do we learn? - Not a lot of improvement by adding more variables to the model with one variable - Maybe gain a small amount with two variables - There is a big difference between the best and worst model --- # Interactive
--- # What do we learn - The best single variable model uses M2012. The next best single variable model uses M2008. - The best two variable model uses M2012 and Population_mil, but there is very little difference between the next two models. ---
---
--- # Coefficients for each model - Extract the parameter estimates for each model - Plot these against the explanatory variable name - Connect values corresponding to the same model with lines --- <img src="week5.class2_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ---
--- # What do we learn - If M2012 is in the model, its coefficient is big and coefficients for other variables are really small. - M2008 has a high coefficient, if M2012 is not in the model - it substitutes for M2012. - GDP_PPP_bil and Population_mil only have high coefficients when M2012 and M2008 are not in the model. --- # Balance complexity vs accuracy - Choose the simplest model that explains almost as much as a more complex model. - Choosing here between a single variable model using M2012, and adding Population_mil. --- # Model A ``` #> #> Call: #> lm(formula = M2016 ~ M2012, data = oly_gdp_2016_12_08) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.4697 -0.0876 0.0238 0.1176 0.3754 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 0.1981 0.0346 5.72 1.6e-07 *** #> M2012 0.8193 0.0378 21.66 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.18 on 84 degrees of freedom #> Multiple R-squared: 0.848, Adjusted R-squared: 0.846 #> F-statistic: 469 on 1 and 84 DF, p-value: <2e-16 ``` --- # Model B ``` #> #> Call: #> lm(formula = M2016 ~ M2012 + Population_mil, data = oly_gdp_2016_12_08) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.5442 -0.0817 0.0158 0.1190 0.3546 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 0.1442 0.0398 3.63 0.0005 *** #> M2012 0.7766 0.0404 19.23 <2e-16 *** #> Population_mil 0.0713 0.0283 2.52 0.0136 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.17 on 83 degrees of freedom #> Multiple R-squared: 0.859, Adjusted R-squared: 0.856 #> F-statistic: 253 on 2 and 83 DF, p-value: <2e-16 ``` --- # Model comparison ``` #> Analysis of Variance Table #> #> Model 1: M2016 ~ M2012 #> Model 2: M2016 ~ M2012 + Population_mil #> Res.Df RSS Df Sum of Sq F Pr(>F) #> 1 84 2.65 #> 2 83 2.46 1 0.189 6.36 0.014 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: inverse middle # Your turn How much more explanatory power do we get by including Population_mil? --- # Diagnostics <img src="week5.class2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Diagnostics <img src="week5.class2_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> Hmm, do we have a problem with one observation? ---
--- # VIFs ``` #> M2012 Population_mil #> 1.2 1.2 ``` <img src="week5.class2_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # Final model - `\(log10(M2016+1)=\)` 0.14 `\(+\)` 0.78 `\(log10(M2012+1)+\)` 0.07 `\(log10(Population.mil)+\varepsilon\)` - Predict the medal count for Australia `\(log10(M2012+1)=1.56\)` and ( `\(log10(Population.mil)=1.358\)` ). How many medals is this? And what is the population? - 0.14 `\(+\)` 0.78 `\(\times 1.56+\)` 0.07 `\(\times 1.358\)` = 1.45 - The observed count for 2016 was 29 medals. What is the estimated medal count? Compute the residual. --- # How does the model look?
--- # Resources - [Statistics online textbook, Diez, Barr, Cetinkaya-Rundel](https://www.openintro.org/stat/textbook.php?stat_book=isrs) --- class: inverse middle # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.