class: center, middle, inverse, title-slide # Statistical Thinking using Randomisation and Simulation ## Boosted models ### Di Cook (
dicook@monash.edu
,
@visnut
) ### W8.C2 --- # Boosting - Why do you want to know about boosting? *The winners of almost all of the recent data competitions have used boosted models.* - The given definition is: "the method that converts weak learner to strong learners" - Actually, its better to think about it as *iteratively re-weighting cases to get a better fit for cases that are poorly predicted* --- # Relationship with weighted regression - Weighted regression is typically used when working with survey data, that to infer from the sample to the population we need to take the sample weights into account - Boosting fits a model, with equal weights. Then increases the weight for the cases with the biggest residuals, and re-fits. --- # Boosting steps - 1.Fit some regression model, where all observations have weight `\(1/n\)`, call this `\(m_1\)` with predictions `\(\hat{y}_1\)` - 2.Compute the residuals, use these to set weights - 3.Fit regression model again, with the weights, call this `\(m_j\)` with predictions `\(\hat{y}_j\)` - 4.Repeat steps 2, 3 many times, say `\(K\)` - 5.Final prediction is `\(\hat{y} = \sum_{i=1}^K \alpha_j \hat{y}_j\)`, where `\(\alpha_j\)` is linear coefficient, that sets the particular combination of the model predictions, could be `\(1/K\)`. Some really nice little animated gifs by Arthur Charpentier [here](https://www.r-bloggers.com/an-attempt-to-understand-boosting-algorithms/). Main ones are on the next few slides. --- Using a decision tree model. Blue is model from a single tee. Red is the boosted tree model. ![](boosting-algo-3.gif) --- # Let's try it <img src="week8.class2_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Generate a test set Boosting can fit training data too closely, so need a test set to protect against this. <img src="week8.class2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Boosted tree - using xgboost ```r library(xgboost) mat <- as.matrix(df) df_xgb <- xgboost(data=mat, label=mat[,2], nrounds=50, max_depth = 1, eta=0.1, gamma=0.1) ``` <img src="week8.class2_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> Error is reported as Root Mean Square Error (RMSE). Training error gets smaller with each iteration. About 26 iterations both converge. --- # Fitted model <img src="week8.class2_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Comparison of models We will use RMSE, since this was the xgboost statistic. Boosted tree model: RMSE=0.073 Random forest model: RMSE=0.13 It doesn't make sense to fit a GLM to this data. --- # Fitted models <img src="week8.class2_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Tennis data Women's grand slam performance statistics for 2012. Response variable: `won` whether the player won the match or not Explanatory variables: `Winners`, `Unforced.Errors`, `First.Serves.In`, `Total.Aces`, `Total.Double.Faults`, `First.Serve.Pts.Won`, `Receiving.Points.Won`, `Break.Point.Conversions`, `RPW.TPW`, `W.minus.Aces`, `SPW.TPW` --- # Generalised linear model Why the binomial family?? ```r tennis_glm <- glm(won~Winners+Unforced.Errors+ RPW.TPW+W.minus.Aces+ SPW.TPW+First.Serves.In+Total.Aces+ Total.Double.Faults+First.Serve.Pts.Won+ Receiving.Points.Won+Break.Point.Conversions, data=train, family=binomial()) glm_pred <- predict(tennis_glm, test, type="response") ``` --- # Random forest ```r tennis_rf <- randomForest(won~Winners+Unforced.Errors+ RPW.TPW+W.minus.Aces+ SPW.TPW+First.Serves.In+Total.Aces+ Total.Double.Faults+First.Serve.Pts.Won+ Receiving.Points.Won+Break.Point.Conversions, data=train, na.action=na.omit, importance=TRUE) rf_pred <- predict(tennis_rf, test) #data.frame(rf_pred) %>% mutate(p=ifelse(X0>X1, X0, X1)) #rf_pred <- rf_pred$p ``` --- # Boosted tree ```r mat <- as.matrix(train[,c(12, 16, 17, 18, 19, 21, 24, 25, 32, 33, 34, 37)]) mat_test <- as.matrix(test[,c(12, 16, 17, 18, 19, 21, 24, 25, 32, 33, 34, 37)]) tennis_xgb <- xgboost(data=mat, label=mat[,12], nrounds=50, max_depth = 1, eta=0.1, gamma=0.1) xgb_pred <- predict(tennis_xgb, mat_test) ``` --- # Comparison GLM model: 0.183 Random forest model: 0.29 Boosted tree model: 0.016 --- # Model fits <img src="week8.class2_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Variable importance Generalised linear model ```r summary(tennis_glm)$coefficients ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -81.7899 18.6726 -4.380 1.19e-05 ## Winners 0.0778 0.1280 0.608 5.43e-01 ## Unforced.Errors -0.0167 0.0472 -0.354 7.23e-01 ## RPW.TPW -19.6733 8.3110 -2.367 1.79e-02 ## W.minus.Aces -0.1094 0.1348 -0.812 4.17e-01 ## SPW.TPW 65.2715 17.7823 3.671 2.42e-04 ## First.Serves.In -0.6364 0.1371 -4.643 3.43e-06 ## Total.Double.Faults -0.2739 0.1388 -1.973 4.85e-02 ## First.Serve.Pts.Won 0.8149 0.2152 3.787 1.53e-04 ## Receiving.Points.Won 1.5078 0.3194 4.721 2.35e-06 ## Break.Point.Conversions -0.0204 0.0295 -0.690 4.90e-01 ``` --- Random forest ```r tennis_rf$importance ``` ``` ## %IncMSE IncNodePurity ## Winners 0.01088 3.86 ## Unforced.Errors 0.01735 4.48 ## RPW.TPW 0.01168 3.80 ## W.minus.Aces 0.00603 2.85 ## SPW.TPW 0.01478 5.57 ## First.Serves.In 0.01634 4.38 ## Total.Aces 0.01512 4.66 ## Total.Double.Faults 0.00413 2.57 ## First.Serve.Pts.Won 0.03407 8.34 ## Receiving.Points.Won 0.21016 31.15 ## Break.Point.Conversions 0.02565 10.33 ``` --- # Resources - [Tutorial on Boosting ](http://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html) - [Hacker earth's tutorial](https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/) - [Analytics Vidhya introduction](https://www.analyticsvidhya.com/blog/2015/11/quick-introduction-boosting-algorithms-machine-learning/) --- class: inverse middle # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.