--- title: 'Statistical Thinking using Randomisation and Simulation' subtitle: 'Boosted models' author: Di Cook (dicook@monash.edu, @visnut) date: "W8.C2" output: xaringan::moon_reader: css: ["default", "myremark.css"] self_contained: false nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- ```{r setup, include = FALSE} library(knitr) opts_chunk$set( message = FALSE, warning = FALSE, cache = FALSE, echo=FALSE, fig.height = 6, fig.width = 9, fig.align='center' ) options(digits=3) library(tidyverse) library(gridExtra) library(broom) ``` # Boosting - Why do you want to know about boosting? *The winners of almost all of the recent data competitions have used boosted models.* - The given definition is: "the method that converts weak learner to strong learners" - Actually, its better to think about it as *iteratively re-weighting cases to get a better fit for cases that are poorly predicted* --- # Relationship with weighted regression - Weighted regression is typically used when working with survey data, that to infer from the sample to the population we need to take the sample weights into account - Boosting fits a model, with equal weights. Then increases the weight for the cases with the biggest residuals, and re-fits. --- # Boosting steps - 1.Fit some regression model, where all observations have weight $1/n$, call this $m_1$ with predictions $\hat{y}_1$ - 2.Compute the residuals, use these to set weights - 3.Fit regression model again, with the weights, call this $m_j$ with predictions $\hat{y}_j$ - 4.Repeat steps 2, 3 many times, say $K$ - 5.Final prediction is $\hat{y} = \sum_{i=1}^K \alpha_j \hat{y}_j$, where $\alpha_j$ is linear coefficient, that sets the particular combination of the model predictions, could be $1/K$. Some really nice little animated gifs by Arthur Charpentier [here](https://www.r-bloggers.com/an-attempt-to-understand-boosting-algorithms/). Main ones are on the next few slides. --- Using a decision tree model. Blue is model from a single tee. Red is the boosted tree model. ![](boosting-algo-3.gif) --- # Let's try it ```{r} set.seed(8000) x=sort(runif(100)-0.5) df <- data.frame(x, y=c(x[1:50]^2, x[51:75]*2, -x[76:100]^2)+rnorm(100)*0.1) ggplot(df, aes(x=x, y=y)) + geom_point() ``` --- # Generate a test set Boosting can fit training data too closely, so need a test set to protect against this. ```{r} x=sort(runif(100)-0.5) df_test <- data.frame(x, y=c(x[1:50]^2, x[51:75]*2, -x[76:100]^2)+rnorm(100)*0.1) ggplot(df, aes(x=x, y=y)) + geom_point() + geom_point(data=df_test, aes(x=x, y=y), shape=2, colour="blue") ``` --- # Boosted tree - using xgboost ```{r echo=TRUE, results='hide', fig.width=5, fig.height=4} library(xgboost) mat <- as.matrix(df) df_xgb <- xgboost(data=mat, label=mat[,2], nrounds=50, max_depth = 1, eta=0.1, gamma=0.1) ``` ```{r fig.width=5, fig.height=4} d <- data.frame(df_xgb$evaluation_log) d$test_rmse <- 1 mat_test <- as.matrix(df_test) for (i in 1:nrow(d)) { d$test_rmse[i] <- sqrt(sum((df_test$y-predict(df_xgb, mat_test, ntreelimit=i))^2)/nrow(df_test)) } d_m <- d %>% gather(type, rmse, -iter) ggplot(d_m) + geom_line(aes(x=iter, y=rmse, colour=type)) + scale_colour_brewer(palette="Dark2") ``` Error is reported as Root Mean Square Error (RMSE). Training error gets smaller with each iteration. About 26 iterations both converge. --- # Fitted model ```{r fig.height=5, fig.width=10} pred <- predict(df_xgb, mat, ntreelimit=26) df_aug <- cbind(df, pred) p1 <- ggplot(df_aug, aes(x=x, y=y)) + geom_point() + ggtitle("training data") + geom_line(aes(y=pred), colour="hotpink", alpha=0.5) pred <- predict(df_xgb, mat_test, ntreelimit=26) df_test_aug <- cbind(df_test, pred) p2 <- ggplot(df_test_aug, aes(x=x, y=y)) + geom_point() + ggtitle("test data") + geom_line(aes(y=pred), colour="hotpink", alpha=0.5) grid.arrange(p1, p2, ncol=2) ``` --- # Comparison of models We will use RMSE, since this was the xgboost statistic. ```{r} RMSE <- function(obs, pred) { e <- obs - pred rmse <- sqrt(sum(e^2)/length(e)) return(rmse) } ``` ```{r} library(randomForest) df_rf <- randomForest(y~x, data=df, ntree=50) rf_pred <- predict(df_rf, df_test) df_test_aug <- cbind(df_test_aug, rf_pred) ``` Boosted tree model: RMSE=`r RMSE(df_test$y, df_test_aug$pred)` Random forest model: RMSE=`r RMSE(df_test$y, df_test_aug$rf_pred)` It doesn't make sense to fit a GLM to this data. --- # Fitted models ```{r fig.height=5, fig.width=10} p1 <- ggplot(df_test_aug, aes(x=x, y=y)) + geom_point() + ggtitle("boosted tree") + geom_line(aes(y=pred), colour="hotpink", alpha=0.5) p2 <- ggplot(df_test_aug, aes(x=x, y=y)) + geom_point() + ggtitle("random forest") + geom_line(aes(y=rf_pred), colour="hotpink", alpha=0.5) grid.arrange(p1, p2, ncol=2) ``` --- # Tennis data Women's grand slam performance statistics for 2012. Response variable: `won` whether the player won the match or not Explanatory variables: `Winners`, `Unforced.Errors`, `First.Serves.In`, `Total.Aces`, `Total.Double.Faults`, `First.Serve.Pts.Won`, `Receiving.Points.Won`, `Break.Point.Conversions`, `RPW.TPW`, `W.minus.Aces`, `SPW.TPW` ```{r} library(caret) tennis <- read_csv("../data/tennis_2012_wta_fixed.csv") indx <- createDataPartition(tennis$won, p=0.67, list=FALSE) train <- tennis[indx,] test <- tennis[-indx,] ``` --- # Generalised linear model Why the binomial family?? ```{r echo=TRUE} tennis_glm <- glm(won~Winners+Unforced.Errors+ RPW.TPW+W.minus.Aces+ SPW.TPW+First.Serves.In+Total.Aces+ Total.Double.Faults+First.Serve.Pts.Won+ Receiving.Points.Won+Break.Point.Conversions, data=train, family=binomial()) glm_pred <- predict(tennis_glm, test, type="response") ``` --- # Random forest ```{r echo=TRUE} tennis_rf <- randomForest(won~Winners+Unforced.Errors+ RPW.TPW+W.minus.Aces+ SPW.TPW+First.Serves.In+Total.Aces+ Total.Double.Faults+First.Serve.Pts.Won+ Receiving.Points.Won+Break.Point.Conversions, data=train, na.action=na.omit, importance=TRUE) rf_pred <- predict(tennis_rf, test) #data.frame(rf_pred) %>% mutate(p=ifelse(X0>X1, X0, X1)) #rf_pred <- rf_pred$p ``` --- # Boosted tree ```{r results='hide', echo=TRUE} mat <- as.matrix(train[,c(12, 16, 17, 18, 19, 21, 24, 25, 32, 33, 34, 37)]) mat_test <- as.matrix(test[,c(12, 16, 17, 18, 19, 21, 24, 25, 32, 33, 34, 37)]) tennis_xgb <- xgboost(data=mat, label=mat[,12], nrounds=50, max_depth = 1, eta=0.1, gamma=0.1) xgb_pred <- predict(tennis_xgb, mat_test) ``` --- # Comparison GLM model: `r RMSE(test$won, glm_pred)` Random forest model: `r RMSE(test$won, rf_pred)` Boosted tree model: `r RMSE(test$won, xgb_pred)` --- # Model fits ```{r fig.height=4, fig.width=12} test_aug <- cbind(test, glm_pred, rf_pred, xgb_pred) p1 <- ggplot(test_aug, aes(x=glm_pred, y=won)) + geom_point() p2 <- ggplot(test_aug, aes(x=rf_pred, y=won)) + geom_point() p3 <- ggplot(test_aug, aes(x=xgb_pred, y=won)) + geom_jitter() grid.arrange(p1, p2, p3, ncol=3) ``` --- # Variable importance Generalised linear model ```{r echo=TRUE} summary(tennis_glm)$coefficients ``` --- Random forest ```{r echo=TRUE} tennis_rf$importance ``` --- # Resources - [Tutorial on Boosting ](http://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html) - [Hacker earth's tutorial](https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/) - [Analytics Vidhya introduction](https://www.analyticsvidhya.com/blog/2015/11/quick-introduction-boosting-algorithms-machine-learning/) --- class: inverse middle # Share and share alike Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.