--- title: 'Statistical Thinking using Randomisation and Simulation' subtitle: "Introduction and motivation" author: Di Cook (dicook@monash.edu, @visnut) date: "W1.C1" output: xaringan::moon_reader: css: ["default", "myremark.css"] self_contained: false nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- ```{r setup, include = FALSE} library(knitr) opts_chunk$set( message = FALSE, cache = FALSE, fig.height = 2, fig.width = 5, collapse = TRUE, comment = "#>" ) options(digits=2) library(dplyr) library(tidyr) ``` # Overview of the class - Topics - Assessment - Resources - Instructors, tutors --- # Topics - Topic 1: Simulation of games for decision strategies (2 weeks)) - Topic 2: Statistical distributions for decision theory (1.5 weeks) - Topic 3: Linear models for credibility theory (1.5 weeks) - Topic 4: Compiling data to problem solve (2 weeks) - Topic 5: Bayesian statistical thinking (1.5 weeks) - Topic 6: Temporal data and time series models (1.5 weeks) - Topic 7: Modeling risk and loss, with data and using randomization to assess uncertainty (2 weeks) --- # Assessment - Final exam: 60% - Tutorials/labs: 30%, Weekly reports due Monday noon after the lab - Quizzes: 10% - ETC5242 students: Labs 15%, Project report and presentation 15% --- # Resources - Web site: [https://st.netlify.com](https://st.netlify.com) - Moodle - [Statistics online textbook](https://www.openintro.org/stat/textbook.php?stat_book=isrs ) - [Accuarial online curriculum/exam material](https://www.actuaries.org.uk/studying/plan-my-study-route/fellowshipassociateship/core-technical-subjects/ct6-statistical-methods) - Software: [R](https://cran.r-project.org), [RStudio Desktop](https://www.rstudio.com/products/rstudio/download2/) --- # Instructors - Instructors: - Professor Di Cook, Menzies 762A - Tutors: - Stuart Lee (working with Di on PhD) - Dilini Talagala (working with Rob Hyndman on PhD) - Thiyanga Talagala (working with Rob Hyndman on PhD) - Nathaniel Tomasetti (worked with Di for Honors, working with Dr Catherine Forbes on PhD) - Earo Wang (working with Di on PhD) --- # What is randomness? - Coin flip - Die roll - Your sporting team wins - Gender of a baby - Rain tomorrow - Stock price in an hour from now - Lightning strike - Pipe burst --- class: inverse middle # Your turn We are going to play a game of "Stump the Professor". Flip a coin. If it shows up tails do A first, if it shows up heads to B first. `A. Write down a sequence of heads and tails that you might expect to come from TWENTY flips of a coin` `B. Now flip a coin TWENTY times, and write down the outcomes` - Enter these in the [online sheet](https://docs.google.com/forms/d/155fP-mdd0HevqNYEVUngEBVWHXmxYi-B5zPzKjikEb0/edit) (Remember whether you entered the coin flip sequence first or the made up sequence.) - Now I am going to look at what you entered, and guess if sequence was made up, or actual outcomes from coin flips. - You record how many times I get it right. --- # Example: a look at the Australian electoral distribution - Results of 2013 election from Australian Electoral Commission web site - 2011 Census data from the Australian Bureau of Statistics - Combined demographics of electorate with political representation - Interactive application, in R package `eechidna` --- # How to use randomization to understand probability ```{r echo=FALSE, fig.width=10, fig.height=4} library(eechidna) library(dplyr) library(ggplot2) aec2013 <- aec2013_2cp_electorate %>% filter(Elected == "Y") aec_abs <- merge(aec2013, abs2011, by = "Electorate") aec_abs$PartyGp <- aec_abs$PartyAb aec_abs$PartyGp[aec_abs$PartyGp %in% c("LP","LNP","NP","CLP")] <- "Coalition" aec_abs$PartyGp[aec_abs$PartyGp %in% c("IND","PUP","KAP","GRN")] <- "Other" ggplot(data=aec_abs, aes(x=Population)) + geom_dotplot(binwidth=2900) + facet_wrap(~PartyGp, ncol = 3) + ylab("") + xlab("Population ('000)") + scale_x_continuous(breaks=seq(75000, 225000, 25000), labels=seq(75, 225, 25)) ``` --- class: inverse middle # Your turn - What is the difference (roughly) in population between the biggest and smallest electorates? - What is the relative worth of a voter in the electorate with the largest population, compared to a voter in the electorate with the smallest population? --- # Politics - Ideally all electorates have exactly the same number of people. - Geography can interfere with this, e.g an electorate cannot be part in Tasmania and part in Victoria. - The Australian Electoral Commission will adjust geographic boundaries before each election to adjust for population changes as measured in the most recent Census. --- # Compute averages ```{r echo=FALSE} aec_abs_means <- aec_abs %>% filter(PartyGp != "Other") %>% group_by(PartyGp) %>% summarise(m = mean(Population, na.rm=T), s = sd(Population, na.rm=T)) aec_abs_means ``` --- # Statistical thinking - The means are different - How big is this difference? - How likely is this difference to have arisen by chance? We could use a two-sample t-test to answer these, but here is how to do the equivalent by randomization. --- # Procedure - Compute the statistic for the data (e.g. absolute value of mean difference) - Shuffle the group labels (e.g. put the MP party names into a hat, mix them around, draw them and assign to new electorate) - Compute the statistic for this shuffled data - Repeat steps 2, 3 many times - Examine how often the value of the data statistic, or a larger value occurs --- # Let's do it ```{r echo=FALSE, fig.height=4, fig.width=8, fig.align='center'} library(purrr) mad <- function(df, shuffle=TRUE) { if (shuffle) df$PartyGp <- sample(df$PartyGp) df_means <- df %>% group_by(PartyGp) %>% summarise(m = mean(Population, na.rm=T)) return(d = abs(df_means$m[1] - df_means$m[2])) } aec_abs_sub <- aec_abs %>% filter(PartyGp != "Other") aec_abs_meandif <- mad(aec_abs_sub, shuffle=FALSE) aec_abs_shuffle <-1:1000 %>% map_dbl(~ mad(aec_abs_sub)) aec_abs_shuffle <- data.frame(d=aec_abs_shuffle, y=1:1000) ggplot(data=aec_abs_shuffle, aes(x=d)) + geom_dotplot(binwidth=100) + geom_vline(xintercept=aec_abs_meandif, colour="red") ``` Let's also count the number of times that we see a bigger difference by chance. It is `r length(aec_abs_shuffle$d[aec_abs_shuffle$d > aec_abs_meandif])`. --- # What does this mean? If we oberve a difference this large `r length(aec_abs_shuffle$d[aec_abs_shuffle$d > aec_abs_meandif])` out of `r length(aec_abs_shuffle$d)` random shuffles, is it likely to see this electorate distribution by chance? --- # Caveats Let's wait until the next Census results are in (after August this year) and the latest election results, to compare populations of electorates again. --- class: inverse middle # Share and share alike Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.