class: center, middle, inverse, title-slide # Statistical Thinking using Randomisation and Simulation ## Bootstrap, Permutation and Linear Models ### Di Cook (
dicook@monash.edu
,
@visnut
) ### W6.C1 --- # Overview of this class - Review of t-tests, confidence intervals and prediction intervals - Review of bootstrap and permutation - Application to linear models --- # Recall the olympics model Counts on the log scale ``` #> term estimate std.error statistic p.value #> 1 (Intercept) 0.1004 0.04816 2.086 4.062e-02 #> 2 M2008 0.9010 0.04910 18.350 1.166e-28 ``` Model is `\(log10(M2012+1)=\)` 0.1004 `\(+\)` 0.901 `\(log10(M2008+1)+\varepsilon\)`. --- class: inverse middle # Your turn Write down the formula that was used to get the test statistic for the slope parameter. --- <img src="week6.class1_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Answer `$$\frac{b_1}{SE(b_1)}$$` where `$$SE(b_1)=\sqrt{\frac{MSE}{\sum_{i=1}^n(x_i - \bar{x})^2}}$$` and $$ MSE=\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{(n - 2)}$$ Check the numbers in the table. --- # t-test `$$H_o: \beta_1 = 0~~vs~~H_a: \beta_1 \neq 0$$` <img src="week6.class1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> Decision: p-value is very small (twice the area to the right of red line), reject `\(H_o\)` Conclusion: The slope parameter for the regression model using the entire population is not 0. --- # Confidence interval for slope `$$b_1 \pm t_{\alpha/2, n-2} SE(b_1)$$` For `\(\alpha=0.05\)`, yielding 95% confidence level, `\(n=\)` 73, `\(t_{\alpha/2, n-2}=\)` 1.9939, 0.901 `\(\pm\)` 1.9939 `\(\times\)` 0.0491 = (0.8031 , 0.9989) Explanation: We are 95% sure that the slope of a regression model fitted to the entire population is between 0.8 and 1.0. --- # Confidence interval for predicted value For a given value of `\(x\)`, `\(\hat{y} \pm t_{\alpha/2, n-2}\sqrt{MSE(\frac{1}{n}+\frac{n(x-\bar{X})^2}{n\sum_{i=1}^n(X_i-\bar{X})^2})}\)` <img src="week6.class1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Prediction interval for NEW value For a given value of `\(x\)`, `\(\hat{y} \pm t_{\alpha/2, n-2}\sqrt{MSE(1+\frac{1}{n}+\frac{n(x-\bar{X})^2}{n\sum_{i=1}^n(X_i-\bar{X})^2})}\)` MSE from model fit is 0.1838. <img src="week6.class1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Computational approach - Hypothesis test can be conducted using permutation - Confidence and prediction intervals can be generated using bootstrap - WHY??? - Classical methods have strict assumptions about the distribution of errors. Computational approaches relax these assumptions --- # Permutation hypothesis tests For regression, to test `\(H_o\)`, one column of the two is permuted to break association. ```r df <- data.frame(x=letters[1:5], y=letters[1:5]) head(data.frame(df, py=sample(df$y))) #> x y py #> 1 a a e #> 2 b b b #> 3 c c c #> 4 d d d #> 5 e e a ``` Make many more permutation sets. --- # ```r p1 <- ggplot(oly, aes(x=M2008, y=M2012)) + geom_point() + ggtitle("Original data") p2 <- ggplot(oly, aes(x=M2008, y=sample(M2012))) + geom_point() + ggtitle("Permuted response") grid.arrange(p1, p2, ncol=2) ``` <img src="week6.class1_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Permutation distribution of intercept and slope <img src="week6.class1_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> Red lines indicate values from our data, which are far from the values obtained from the permuted data. --- # Statistical significance - Permutation gives us samples consistent with `\(H_o: \beta_1=0\)`, whilst keeping the marginal distributions of X and Y the same. - In the example we see that the values from the permuted data, center on 0. We are seeing what the variation in `\(b_1\)` might be, from one sample to another, if the parameter `\(\beta_1\)` (slope computed for the whole population) is actually 0. - - To compute the p-value, count the number of values computed on the permuted data that are more extreme than the values from the actual data. - In this example, the p-value is 0 for both intercept and slope. --- # Confidence intervals via bootstrap - 1.Make a N boostrap samples (sample data rows, with replacement) - 2.Fit the model for each - 3.Compute lower and upper C% bounds, by sorting values and pulling the relevant ones, e.g. if N=1000, C=95, we would take the 25$^{th}$ and 975$^{th}$ values as the lower and upper CI bounds --- # Bootstrap samples ```r orig <- letters[1:10] orig #> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" boot1 <- sort(sample(orig, replace=TRUE)) boot1 #> [1] "a" "b" "b" "g" "g" "h" "h" "j" "j" "j" ``` --- # Bootstrap confidence interval for the slope <img src="week6.class1_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> - Intercept: (-2.314\times 10^{-4} , 0.2131) - Slope: (0.8057 , 0.982) --- # Compare intervals ``` #> label l u #> 1 classical 0.8031 0.9989 #> 2 bootstrap 0.8057 0.9820 ``` <img src="week6.class1_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Bootstrap confidence intervals for predicted value <img src="week6.class1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Bootstrap prediction intervals for NEW values Procedure derives from bootstrapping residuals. - 1.Compute the residuals from the fitted model - 2.Bootstrap the residuals - 3.Find the desired quantiles of the residuals - 4.Compute prediction intervals by adding residual quantiles to fitted value --- <img src="week6.class1_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Example: 2000 US Elections ![](butterfly_ballot.jpg) --- # Example: Confusing? ![](palmballot.jpg) --- <img src="week6.class1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ``` #> Observations: 67 #> Variables: 17 #> $ County <chr> "ALACHUA", "BAKER", "BAY", "BRADFORD", "BRE... #> $ Palm_Beach <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... #> $ Population <int> 198326, 20761, 146223, 24646, 460977, 14707... #> $ Log_Population <dbl> 12.198, 9.941, 11.893, 10.112, 13.041, 14.2... #> $ Black <dbl> 21.8, 16.8, 12.4, 22.9, 9.2, 17.5, 16.9, 4.... #> $ Hispanic <dbl> 4.7, 1.5, 2.4, 2.6, 4.1, 10.9, 1.6, 3.4, 2.... #> $ age 65 (%) <dbl> 9.428, 7.697, 11.882, 11.819, 16.462, 20.32... #> $ college <dbl> 34.6, 5.7, 15.7, 8.1, 20.4, 18.8, 8.2, 13.4... #> $ Income (Thousands) <dbl> 26.60, 27.61, 26.85, 25.28, 33.28, 31.26, 2... #> $ Income (Dollars) <int> 26597, 27614, 26846, 25277, 33284, 31264, 2... #> $ Age 65 (total) <int> 18698, 1598, 17374, 2913, 75888, 298900, 17... #> $ Gore <int> 47365, 2392, 18850, 3075, 97318, 386561, 21... #> $ Bush <int> 34124, 5610, 38637, 5414, 115185, 177323, 2... #> $ Buchanan <int> 262, 73, 248, 65, 570, 789, 90, 182, 270, 1... #> $ Nader <int> 3215, 53, 828, 84, 4470, 7099, 39, 1462, 13... #> $ Total Votes <int> 84966, 8128, 58563, 8638, 217543, 571772, 5... #> $ Log_Buchanan <dbl> -1.1765, -0.1074, -0.8593, -0.2844, -1.3393... ``` --- # Fit model, without Palm Beach ``` #> term estimate std.error statistic p.value #> 1 (Intercept) 2.14650 0.395463 5.428 1.083e-06 #> 2 `age 65 (%)` -0.04153 0.006992 -5.939 1.551e-07 #> 3 Black -0.01325 0.004594 -2.884 5.450e-03 #> 4 Hispanic -0.03495 0.005135 -6.807 5.329e-09 #> 5 college -0.01933 0.009709 -1.991 5.100e-02 #> 6 `Income (Thousands)` -0.06580 0.014363 -4.582 2.385e-05 ``` --- # Predictors <img src="week6.class1_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ---
---
---
---
---
--- # Check model <img src="week6.class1_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ---
---
--- # Predict Palm Beach ```r pb <- florida %>% filter(County=="PALM BEACH") pb_p <- predict(florida_lm, pb) pb_e <- pb$Log_Buchanan - pb_p cbind(pb$Log_Buchanan, pb_p, pb_e) #> pb_p pb_e #> 1 -0.2365 -2.003 1.766 ``` --- # Bootstrap confidence for predictions <img src="week6.class1_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- # Bootstrap prediction intervals <img src="week6.class1_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- # Summary - The number of votes for Buchanan in Palm Beach County were much higher than could be expected given the demographic composition of the location. - This is evidence that the butterfly ballot may have caused some confusion, and error in voting intention. --- # Resources - [Statistics online textbook, Diez, Barr, Cetinkaya-Rundel](https://www.openintro.org/stat/textbook.php?stat_book=isrs) - [Nice example for automotive costs](http://fmwww.bc.edu/cef00/papers/paper71.pdf) - [2000 US Election Florida undercount](http://www.ssc.wisc.edu/~bhansen/vote/florida.pdf) --- class: inverse middle # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.