class: center, middle, inverse, title-slide # Statistical Thinking using Randomisation and Simulation ## Linear models: diagnostics ### W5.C1 --- # Modeling Olympic medal counts We fit the medal count for 2016, purely on the counts from 2012, to illustrate the influence diagnostics. ``` #> term estimate std.error statistic p.value #> 1 (Intercept) 0.72 0.526 1.4 1.7e-01 #> 2 M2012 0.94 0.026 36.5 4.5e-58 ``` Giving the model, `\(M_{2016}=\)` 0.72 `\(+\)` 0.94 `\(M_{2012} + \varepsilon\)` --- class: inverse middle # Your turn - Should the model be re-fit with the intercept forced to ZERO? --
--- # Model diagnostics - Based on `leave-one-out` statistics - For `\(n\)` observations, fit `\(n\)` models where each model has one observation removed. - Let's take a look at fitting the medal tallies, without the USA. ``` #> all noUSA estimate #> 1 0.72 1.33 intercept #> 2 0.94 0.84 slope ``` -- - Parameter estimates change a little --- <img src="week5.class1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Other model fit parameters - deviance - predicted values, residuals | | null.dev| deviance| fitted| resid| |:------|--------:|--------:|------:|-----:| |All | 29766| 2002| 98| 5.9| |No USA | 17298| 1260| 89| 15.3| --- # What it could look like <img src="week5.class1_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Leverage Leverage `\(h_{ii}\)` is defined for each observation, `\(1, ..., n\)`, and is the `\(i^{th}\)` diagonal element of the hat matrix: `$$H=X(X^TX)^{-1}X^T$$` where `\(X\)` is the design matrix, e.g. for `\(\beta_0+\beta_1x\)`, `$$X=\left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right]$$` Intuitively, observations which are far from the mean of the explanatory variables will have higher leverage. YOU CAN CALCULATE THIS WITHOUT FITTING ALL `\(n\)` MODELS! --- # Highest leverage for medal tally model ``` #> Country .hat #> 1 UnitedStates 0.290 #> 2 China 0.203 #> 3 RussianFed 0.175 #> 4 GreatBritain 0.106 #> 5 Germany 0.047 #> 6 Japan 0.035 #> 7 Australia 0.030 #> 8 France 0.029 #> 9 SouthKorea 0.021 #> 10 Italy 0.021 #> 11 Netherlands 0.013 #> 12 Ukraine 0.013 #> 13 Vietnam 0.013 #> 14 IvoryCoast 0.013 #> 15 IndependentOlympicAthletes 0.013 ``` Cutoff for high leverage is `\(2p/n = 2*1/73 = 0.027\)`. --- # Plot of leverage <img src="week5.class1_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Log-tranform the counts <img src="week5.class1_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ``` #> Country .hat #> 1 UnitedStates 0.081 #> 2 China 0.074 #> 3 RussianFed 0.070 #> 4 GreatBritain 0.061 #> 5 Germany 0.047 #> 6 Japan 0.042 #> 7 Australia 0.040 #> 8 France 0.039 #> 9 SouthKorea 0.034 #> 10 Italy 0.034 #> 11 Vietnam 0.030 #> 12 IvoryCoast 0.030 #> 13 IndependentOlympicAthletes 0.030 #> 14 Fiji 0.030 #> 15 Jordan 0.030 ``` --- <img src="week5.class1_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> Transforming skewed variables reduces the influence of any one, or few points. The distribution is more even, and the highest leverage value is much lower now. --- # Hat values for simulated data <img src="week5.class1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Cooks D Leverage takes no notice of the response variable. So the USA did not have a huge influence because its medal count in 2012 was similar to that in 2008, so it was close to the trend. If for some reason the medal count in 2012 was 0, the line with the USA would be much more drawn away from the other countries. Cooks D, and DFFITS, also use the response variable, to assess influence. `$$D_i = \frac{e_i^2}{{MSE}p}\frac{h_{ii}}{(1-h_{ii})^2}$$` where `\(e_i\)` is the `\(i^{th}\)` residual, `\(p=\)`number of explanatory variables, and MSE is the mean squared error of the linear model. Values greater than `\(4/n\)` are large, by a rule of thumb. Or alternatively, greater than 1 is another rule of thumb. --- # Cooks D for Olympic medal tally `$$~~~~~~~~~ Raw ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Transformed$$` ``` #> Country .cooksd Country .cooksd #> 1 UnitedStates 7.3e+00 Vietnam 4.6e-02 #> 2 RussianFed 3.1e+00 IvoryCoast 4.6e-02 #> 3 China 1.3e+00 IndependentOlympicAthletes 4.6e-02 #> 4 GreatBritain 9.5e-02 Israel 4.6e-02 #> 5 France 6.5e-02 UnitedStates 3.3e-02 #> 6 Ukraine 2.4e-02 Latvia 3.3e-02 #> 7 Japan 2.0e-02 Rep.ofMoldova 3.3e-02 #> 8 Uzbekistan 2.0e-02 Uganda 2.2e-02 #> 9 SouthKorea 1.8e-02 Botswana 2.2e-02 #> 10 Azerbaijan 1.6e-02 Cyprus 2.2e-02 #> 11 Australia 1.5e-02 Gabon 2.2e-02 #> 12 Denmark 8.6e-03 Guatemala 2.2e-02 #> 13 NewZealand 6.7e-03 Montenegro 2.2e-02 #> 14 Romania 6.6e-03 Afghanistan 2.2e-02 #> 15 Canada 5.9e-03 HongKongChina 2.2e-02 #> 16 India 4.9e-03 SaudiArabia 2.2e-02 #> 17 Kazakhstan 4.4e-03 Kuwait 2.2e-02 #> 18 Iran 4.0e-03 Uzbekistan 1.8e-02 #> 19 Croatia 3.5e-03 TrinidadTobago 1.8e-02 #> 20 SouthAfrica 3.5e-03 India 1.5e-02 #> 21 Greece 3.5e-03 Fiji 1.3e-02 #> 22 Serbia 3.5e-03 Jordan 1.3e-02 #> 23 TrinidadTobago 3.3e-03 Kosovo 1.3e-02 #> 24 Ireland 3.1e-03 Burundi 1.3e-02 #> 25 Mongolia 3.1e-03 Niger 1.3e-02 #> 26 Belarus 2.2e-03 Philippines 1.3e-02 #> 27 Cuba 2.1e-03 Austria 1.3e-02 #> 28 Latvia 2.0e-03 Nigeria 1.3e-02 #> 29 Rep.ofMoldova 2.0e-03 UnitedArabEmirates 1.3e-02 #> 30 Sweden 2.0e-03 Greece 1.3e-02 #> 31 Finland 1.8e-03 Finland 1.1e-02 #> 32 Turkey 1.8e-03 Romania 1.1e-02 #> 33 Switzerland 1.8e-03 France 1.1e-02 #> 34 Belgium 1.7e-03 Ireland 1.0e-02 #> 35 Thailand 1.7e-03 Mongolia 1.0e-02 #> 36 Malaysia 1.7e-03 Ukraine 9.9e-03 #> 37 Brazil 1.6e-03 Azerbaijan 9.7e-03 #> 38 Mexico 1.3e-03 Venezuela 9.6e-03 #> 39 Kenya 9.9e-04 GreatBritain 9.0e-03 #> 40 Uganda 8.5e-04 Malaysia 8.1e-03 #> 41 Botswana 8.5e-04 Denmark 6.5e-03 #> 42 Cyprus 8.5e-04 PuertoRico 5.9e-03 #> 43 Gabon 8.5e-04 Singapore 5.9e-03 #> 44 Guatemala 8.5e-04 Qatar 5.9e-03 #> 45 Montenegro 8.5e-04 DominicanRep. 5.9e-03 #> 46 Afghanistan 8.5e-04 Estonia 5.9e-03 #> 47 HongKongChina 8.5e-04 Serbia 5.6e-03 #> 48 SaudiArabia 8.5e-04 Japan 5.4e-03 #> 49 Kuwait 8.5e-04 NewZealand 4.8e-03 #> 50 Hungary 7.7e-04 Belgium 4.8e-03 #> 51 PuertoRico 7.6e-04 Thailand 4.8e-03 #> 52 Singapore 7.6e-04 Croatia 4.1e-03 #> 53 Qatar 7.6e-04 SouthAfrica 4.1e-03 #> 54 DominicanRep. 7.6e-04 Canada 3.9e-03 #> 55 Estonia 7.6e-04 Kazakhstan 3.5e-03 #> 56 Italy 5.9e-04 Switzerland 3.5e-03 #> 57 Venezuela 5.6e-04 Iran 3.0e-03 #> 58 Vietnam 5.3e-04 Turkey 2.8e-03 #> 59 IvoryCoast 5.3e-04 RussianFed 2.6e-03 #> 60 IndependentOlympicAthletes 5.3e-04 Sweden 2.3e-03 #> 61 Israel 5.3e-04 Bahrain 2.2e-03 #> 62 Lithuania 5.3e-04 Bahamas 2.2e-03 #> 63 Jamaica 2.3e-04 Algeria 2.2e-03 #> 64 Poland 2.1e-04 Germany 2.1e-03 #> 65 Ethiopia 1.4e-04 Brazil 1.7e-03 #> 66 Tajikistan 1.3e-04 Mexico 1.7e-03 #> 67 Grenada 1.3e-04 Italy 1.4e-03 #> 68 Morocco 1.3e-04 Belarus 1.3e-03 #> 69 Portugal 1.3e-04 Kenya 1.3e-03 #> 70 NorthKorea 1.2e-04 SouthKorea 1.1e-03 #> 71 Tunisia 8.0e-05 Lithuania 8.2e-04 #> 72 Netherlands 6.4e-05 Indonesia 8.2e-04 #> 73 Armenia 6.4e-05 ChineseTaipei 8.2e-04 #> 74 Argentina 6.0e-05 Bulgaria 8.2e-04 #> 75 Slovakia 6.0e-05 Egypt 8.2e-04 #> 76 Slovenia 6.0e-05 Cuba 7.9e-04 #> 77 Norway 6.0e-05 Tajikistan 6.5e-04 #> 78 Indonesia 4.9e-05 Grenada 6.5e-04 #> 79 ChineseTaipei 4.9e-05 Morocco 6.5e-04 #> 80 Bulgaria 4.9e-05 Portugal 6.5e-04 #> 81 Egypt 4.9e-05 Armenia 4.8e-04 #> 82 Spain 3.9e-05 Poland 4.7e-04 #> 83 Bahrain 3.7e-05 Spain 3.9e-04 #> 84 Bahamas 3.7e-05 Ethiopia 3.5e-04 #> 85 Algeria 3.7e-05 NorthKorea 3.3e-04 #> 86 Fiji 2.5e-05 China 2.9e-04 #> 87 Jordan 2.5e-05 Netherlands 2.0e-04 #> 88 Kosovo 2.5e-05 Tunisia 9.1e-05 #> 89 Burundi 2.5e-05 CzechRepublic 4.9e-05 #> 90 Niger 2.5e-05 Argentina 3.6e-05 #> 91 Philippines 2.5e-05 Slovakia 3.6e-05 #> 92 Austria 2.5e-05 Slovenia 3.6e-05 #> 93 Nigeria 2.5e-05 Norway 3.6e-05 #> 94 UnitedArabEmirates 2.5e-05 Hungary 2.1e-05 #> 95 Georgia 1.9e-05 Jamaica 1.3e-05 #> 96 Germany 1.2e-05 Australia 1.0e-05 #> 97 Colombia 1.1e-05 Colombia 9.8e-06 #> 98 CzechRepublic 1.6e-06 Georgia 1.2e-06 ``` --- <img src="week5.class1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Cooks D for simulated data <img src="week5.class1_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> Values are more spread, when the one extreme value is removed. No other points are influential. --- # Solutions - Remove influential observations, and re-fit model - Transform explanatory variables to reduce influence - Use weighted regression to downweight influence of extreme observations --- class: inverse middle # Your turn - What happens when there are two extreme points with virtually the same values? --- # Collinearity Population and GDP are standardised. ``` #> term estimate std.error statistic p.value #> 1 (Intercept) 1.8604 0.49070 3.8 2.9e-04 #> 2 Total_2012 0.7471 0.04108 18.2 1.6e-30 #> 3 Population_mil -0.0260 0.00384 -6.8 1.7e-09 #> 4 GDP_PPP_bil 0.0024 0.00038 6.4 8.4e-09 ``` Giving the model `\(M2016=\)` 1.86 `\(+\)` 0.75 `\(M2012+\)` -0.03 `\(Pop+\)` 0 `\(GDP+\varepsilon\)` --- # Plot the explanatory variables <img src="week5.class1_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- # Explore countries
--- # Variance inflation factor (VIF) `$$\frac{1}{1-R_j^2}$$` where `\(R_j^2\)` is computed by regressing variable `\(j\)` on all other variables. VIF is a measure the collinearity of the explanatory variables. Values greater than 10 are considered to be high. These are the VIFs for the olympic medal tally data: ``` #> Total_2012 Population_mil GDP_PPP_bil #> 3.6 3.5 7.6 ``` --- # Suppose we add 2008 counts as an explanatory variable <img src="week5.class1_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # Model ``` #> term estimate std.error statistic p.value #> 1 (Intercept) 2.2413 0.57642 3.9 2.3e-04 #> 2 Total_2012 0.9080 0.10146 8.9 3.7e-13 #> 3 Total_2008 -0.2025 0.10785 -1.9 6.5e-02 #> 4 Population_mil -0.0278 0.00410 -6.8 3.3e-09 #> 5 GDP_PPP_bil 0.0028 0.00043 6.5 1.1e-08 ``` Giving the model `\(M2016=\)` 2.24 `\(+\)` 0.91 `\(2012+\)` -0.2 `\(M2008+\)` -0.03 `\(Pop+\)` 0 `\(GDP+\varepsilon\)` --- # VIFs ``` #> Total_2012 Total_2008 Population_mil GDP_PPP_bil #> 18.7 21.7 3.6 8.5 ``` Notice that the VIFs for both 2008 and 2012 are high. --- class: inverse middle # Your turn - Why is it called `Variance Inflation Factor`? Look at the standard deviation of the estimates for the model with 2012 and without 2004. - Why would multicollinearity inflate variance of estimates? --- # Solutions - Drop some variables - Use principal component regression (more advanced courses) - Partial regression: Fit best variable. Regress next explanatory variable first variable and use the residuals from this fit as the second variable in the model. Continue with other variables. --- # Resources - [Regression Diagnostics: Identifying Influential Data and Sources of Collinearity](http://onlinelibrary.wiley.com/book/10.1002/0471725153) --- class: inverse middle # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.