class: center, middle, inverse, title-slide # Statistical Thinking using Randomisation and Simulation ## Ensemble models using bootstrap ### Di Cook (
dicook@monash.edu
,
@visnut
) ### W8.C1 --- # Bootstrapping - Taking bootstrap samples, then fitting a model to each sample, and combining model predictions reduces predictive accuracy - The famous example is a random forest. - A forest is constructed by taking a bootstrap sample, and a random sample of available explanatory variables, fitting a tree to each, then pooling the model predictions. --- # Example - same as for trees <img src="week8.class1_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Bootstrap samples <img src="week8.class1_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- .pull-left[ Sample 1 <img src="week8.class1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] .pull-right[ Sample 2 <img src="week8.class1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- # Combine predictions ```r df_rp_aug <- cbind(df, p1=predict(df1_rp, df), p2=predict(df2_rp, df)) df_rp_aug[91:100,] ``` ``` ## x y p1 p2 ## 91 0.353 -0.0257 -0.163 -0.0923 ## 92 0.357 -0.1950 -0.163 -0.0923 ## 93 0.360 -0.2569 -0.163 -0.0923 ## 94 0.360 0.1136 -0.163 -0.0923 ## 95 0.402 -0.1063 -0.163 -0.0923 ## 96 0.406 -0.1189 -0.163 -0.2045 ## 97 0.413 -0.3206 -0.163 -0.2045 ## 98 0.440 -0.4162 -0.163 -0.2045 ## 99 0.455 -0.1108 -0.163 -0.2045 ## 100 0.468 -0.1376 -0.163 -0.2045 ``` --- .pull-left[ ```r df_rp_aug <- df_rp_aug %>% mutate(p=(p1+p2)/2) df_rp_aug[91:100,] ``` ``` ## x y p1 p2 p ## 91 0.353 -0.0257 -0.163 -0.0923 -0.128 ## 92 0.357 -0.1950 -0.163 -0.0923 -0.128 ## 93 0.360 -0.2569 -0.163 -0.0923 -0.128 ## 94 0.360 0.1136 -0.163 -0.0923 -0.128 ## 95 0.402 -0.1063 -0.163 -0.0923 -0.128 ## 96 0.406 -0.1189 -0.163 -0.2045 -0.184 ## 97 0.413 -0.3206 -0.163 -0.2045 -0.184 ## 98 0.440 -0.4162 -0.163 -0.2045 -0.184 ## 99 0.455 -0.1108 -0.163 -0.2045 -0.184 ## 100 0.468 -0.1376 -0.163 -0.2045 -0.184 ``` ] .pull-right[ <img src="week8.class1_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- # Compare errors .pull-left[ Tree model on full data ```r df_rp <- rpart(y~x, df) df_rp_aug <- cbind(df_rp_aug, full_p=predict(df_rp, df)) MSE(df_rp_aug$y, df_rp_aug$full_p) ``` ``` ## [1] 0.010066 ``` ] .pull-right[ Average of two models ```r MSE(df_rp_aug$y, df_rp_aug$p) ``` ``` ## [1] 0.010045 ``` ] Reduces the error (a little)! --- # Add more trees .pull-left[ MSE from averaging 5 trees ``` ## [1] 0.0094727 ``` reduces error by quite a bit more. ] .pull-right[ <img src="week8.class1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- # Random forests - Multiple trees, fit to bootstrap samples - Sample cases, using bootstrapping (ones not chosen are called out-of-bag, used for testing purposes) - Sample variables to choose from at each node of each tree - Lots of diagnostics generated! - Bagging stands for `bootstrap aggregation`. Combine the results from multiple models built on different bootstrap samples. Random forests are an example of bagging - Bagging can be used with almost any classifier - Bagging reduces variation in estimates --- # Example: Tennis performance Women's grand slam performance statistics for 2012. Response variable: `won` whether the player won the match or not Explanatory variables: `Winners`, `Unforced.Errors`, `First.Serves.In`, `Total.Aces`, `Total.Double.Faults`, `First.Serve.Pts.Won`, `Receiving.Points.Won`, `Break.Point.Conversions` Split the data into two pieces, training and test. Model will be fit on training set, error will be calculated on test set. ```r library(caret) tennis <- read_csv("../data/tennis_2012_wta_fixed.csv") indx <- createDataPartition(tennis$won, p=0.67, list=FALSE) train <- tennis[indx,] test <- tennis[-indx,] #train %>% count(won) #test %>% count(won) ``` --- # Random forest fit to training ``` ## ## Call: ## randomForest(formula = won ~ Winners + Unforced.Errors + RPW.TPW + W.minus.Aces + SPW.TPW + First.Serves.In + Total.Aces + Total.Double.Faults + First.Serve.Pts.Won + Receiving.Points.Won + Break.Point.Conversions, data = train, importance = TRUE, na.action = na.omit) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 3 ## ## Mean of squared residuals: 0.081767 ## % Var explained: 67.29 ``` Test error ``` ## [1] 0.090128 ``` --- # Diagnostics: variable importance ``` ## %IncMSE IncNodePurity variable ## 1 0.2256678 33.5117 Receiving.Points.Won ## 2 0.0290520 6.8291 First.Serve.Pts.Won ## 3 0.0204288 5.6186 Unforced.Errors ## 4 0.0184415 8.5577 Break.Point.Conversions ## 5 0.0165248 6.1234 SPW.TPW ## 6 0.0164064 4.3985 First.Serves.In ## 7 0.0105363 3.5620 Winners ## 8 0.0092988 3.6010 Total.Aces ## 9 0.0078857 3.9130 RPW.TPW ## 10 0.0074110 2.6260 W.minus.Aces ## 11 0.0064382 3.1347 Total.Double.Faults ``` With the sampling of variables, it is possible to assess the effect of each variable on the accuracy of predictions. For winning a tennis match `Receiving.Points.Won` is by far the best variable. Keep this high, and your chances of winning are much improved. --- # Effect of number of trees .pull-left[ - The training and test error changes as number of trees averaged increases. - Training error could go to zero. It is buffered here, because only the "out-of-bag" cases are predicted. ] .pull-right[ <img src="week8.class1_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] - Test error will decline, up to a point, and then plateau, or worse get higher. - Where the curves cross is the best choice of number of trees, about 20 here. - Balance complexity vs accuracy (bias vs variance trade-off) --- # Resources - [Tutorial on Tree Based Modeling ](https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/) --- class: inverse middle # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.