---
title: 'Statistical Thinking using Randomisation and Simulation'
subtitle: "Introduction and motivation"
author: Di Cook (dicook@monash.edu, @visnut)
date: "W1.C1"
output:
xaringan::moon_reader:
css: ["default", "myremark.css"]
self_contained: false
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r setup, include = FALSE}
library(knitr)
opts_chunk$set(
message = FALSE,
cache = FALSE,
fig.height = 2,
fig.width = 5,
collapse = TRUE,
comment = "#>"
)
options(digits=2)
library(dplyr)
library(tidyr)
```
# Overview of the class
- Topics
- Assessment
- Resources
- Instructors, tutors
---
# Topics
- Topic 1: Simulation of games for decision strategies (2 weeks))
- Topic 2: Statistical distributions for decision theory (1.5 weeks)
- Topic 3: Linear models for credibility theory (1.5 weeks)
- Topic 4: Compiling data to problem solve (2 weeks)
- Topic 5: Bayesian statistical thinking (1.5 weeks)
- Topic 6: Temporal data and time series models (1.5 weeks)
- Topic 7: Modeling risk and loss, with data and using randomization to assess uncertainty (2 weeks)
---
# Assessment
- Final exam: 60%
- Tutorials/labs: 30%, Weekly reports due Monday noon after the lab
- Quizzes: 10%
- ETC5242 students: Labs 15%, Project report and presentation 15%
---
# Resources
- Web site: [https://st.netlify.com](https://st.netlify.com)
- Moodle
- [Statistics online textbook](https://www.openintro.org/stat/textbook.php?stat_book=isrs
)
- [Accuarial online curriculum/exam material](https://www.actuaries.org.uk/studying/plan-my-study-route/fellowshipassociateship/core-technical-subjects/ct6-statistical-methods)
- Software: [R](https://cran.r-project.org), [RStudio Desktop](https://www.rstudio.com/products/rstudio/download2/)
---
# Instructors
- Instructors:
- Professor Di Cook, Menzies 762A
- Tutors:
- Stuart Lee (working with Di on PhD)
- Dilini Talagala (working with Rob Hyndman on PhD)
- Thiyanga Talagala (working with Rob Hyndman on PhD)
- Nathaniel Tomasetti (worked with Di for Honors, working with Dr Catherine Forbes on PhD)
- Earo Wang (working with Di on PhD)
---
# What is randomness?
- Coin flip
- Die roll
- Your sporting team wins
- Gender of a baby
- Rain tomorrow
- Stock price in an hour from now
- Lightning strike
- Pipe burst
---
class: inverse middle
# Your turn
We are going to play a game of "Stump the Professor". Flip a coin. If it shows up tails do A first, if it shows up heads to B first.
`A. Write down a sequence of heads and tails that you might expect to come from TWENTY flips of a coin`
`B. Now flip a coin TWENTY times, and write down the outcomes`
- Enter these in the [online sheet](https://docs.google.com/forms/d/155fP-mdd0HevqNYEVUngEBVWHXmxYi-B5zPzKjikEb0/edit) (Remember whether you entered the coin flip sequence first or the made up sequence.)
- Now I am going to look at what you entered, and guess if sequence was made up, or actual outcomes from coin flips.
- You record how many times I get it right.
---
# Example: a look at the Australian electoral distribution
- Results of 2013 election from Australian Electoral Commission web site
- 2011 Census data from the Australian Bureau of Statistics
- Combined demographics of electorate with political representation
- Interactive application, in R package `eechidna`
---
# How to use randomization to understand probability
```{r echo=FALSE, fig.width=10, fig.height=4}
library(eechidna)
library(dplyr)
library(ggplot2)
aec2013 <- aec2013_2cp_electorate %>%
filter(Elected == "Y")
aec_abs <- merge(aec2013, abs2011, by = "Electorate")
aec_abs$PartyGp <- aec_abs$PartyAb
aec_abs$PartyGp[aec_abs$PartyGp %in% c("LP","LNP","NP","CLP")] <- "Coalition"
aec_abs$PartyGp[aec_abs$PartyGp %in% c("IND","PUP","KAP","GRN")] <- "Other"
ggplot(data=aec_abs, aes(x=Population)) + geom_dotplot(binwidth=2900) +
facet_wrap(~PartyGp, ncol = 3) + ylab("") + xlab("Population ('000)") +
scale_x_continuous(breaks=seq(75000, 225000, 25000), labels=seq(75, 225, 25))
```
---
class: inverse middle
# Your turn
- What is the difference (roughly) in population between the biggest and smallest electorates?
- What is the relative worth of a voter in the electorate with the largest population, compared to a voter in the electorate with the smallest population?
---
# Politics
- Ideally all electorates have exactly the same number of people.
- Geography can interfere with this, e.g an electorate cannot be part in Tasmania and part in Victoria.
- The Australian Electoral Commission will adjust geographic boundaries before each election to adjust for population changes as measured in the most recent Census.
---
# Compute averages
```{r echo=FALSE}
aec_abs_means <- aec_abs %>% filter(PartyGp != "Other") %>%
group_by(PartyGp) %>%
summarise(m = mean(Population, na.rm=T), s = sd(Population, na.rm=T))
aec_abs_means
```
---
# Statistical thinking
- The means are different
- How big is this difference?
- How likely is this difference to have arisen by chance?
We could use a two-sample t-test to answer these, but here is how to do the equivalent by randomization.
---
# Procedure
- Compute the statistic for the data (e.g. absolute value of mean difference)
- Shuffle the group labels (e.g. put the MP party names into a hat, mix them around, draw them and assign to new electorate)
- Compute the statistic for this shuffled data
- Repeat steps 2, 3 many times
- Examine how often the value of the data statistic, or a larger value occurs
---
# Let's do it
```{r echo=FALSE, fig.height=4, fig.width=8, fig.align='center'}
library(purrr)
mad <- function(df, shuffle=TRUE) {
if (shuffle)
df$PartyGp <- sample(df$PartyGp)
df_means <- df %>% group_by(PartyGp) %>%
summarise(m = mean(Population, na.rm=T))
return(d = abs(df_means$m[1] - df_means$m[2]))
}
aec_abs_sub <- aec_abs %>% filter(PartyGp != "Other")
aec_abs_meandif <- mad(aec_abs_sub, shuffle=FALSE)
aec_abs_shuffle <-1:1000 %>% map_dbl(~ mad(aec_abs_sub))
aec_abs_shuffle <- data.frame(d=aec_abs_shuffle, y=1:1000)
ggplot(data=aec_abs_shuffle, aes(x=d)) + geom_dotplot(binwidth=100) +
geom_vline(xintercept=aec_abs_meandif, colour="red")
```
Let's also count the number of times that we see a bigger difference by chance. It is `r length(aec_abs_shuffle$d[aec_abs_shuffle$d > aec_abs_meandif])`.
---
# What does this mean?
If we oberve a difference this large `r length(aec_abs_shuffle$d[aec_abs_shuffle$d > aec_abs_meandif])` out of `r length(aec_abs_shuffle$d)` random shuffles, is it likely to see this electorate distribution by chance?
---
# Caveats
Let's wait until the next Census results are in (after August this year) and the latest election results, to compare populations of electorates again.
---
class: inverse middle
# Share and share alike
This work is licensed under a Creative Commons Attribution 4.0 International License.