Statistical Thinking using Randomisation and Simulation

class: center, middle, inverse, title-slide

# Statistical Thinking using Randomisation and Simulation
## Statistical distributions
### Di Cook (<a href="mailto:dicook@monash.edu">dicook@monash.edu</a>, <span class="citation">@visnut</span>)
### W3.C1

---

# Overview of this class

- Random numbers
- Mapping random numbers to events for simulation
- Statistical distributions
- Density functions

---
# Random numbers

- True random number generators: [Radioactive decay](https://www.fourmilab.ch/hotbits/), [electromagnetic field of a vacuum](https://qrng.anu.edu.au)
- Computers only technically provide pseudo-random numbers, using deterministic process, e.g linear congruential, for large `$a, b, m$`

`$$X_{n+1} = (aX_n + b) ~~mod ~~m$$`

---
# RANDU - a bad PRNG

- Used in the 60s and onwards

`$$X_{n+1} = 65539 X_n ~~mod ~~2^{31}$$`

---
# Mersenne Twister

- algorithm is a twisted generalised feedback shift register (TGFSR)
- based on a Marsenne prime, `$2^m-1$`
- most commonly used today
- each integer will occur the same number of times in a period

---
# Using random numbers to estimate things

- Suppose we want to estimate the proportion of the class in 2420 vs 5242
- Now we know that there are 36 5242 and 152 2420 students enrolled in the class. BUT SUPPOSE WE DON'T KNOW THIS!
- We see a random sample of 20 students, and have to guess what the proportion for the whole class is.

---
# Using random numbers

- Random number tables (old fashioned) deliver single digits 0, 1, ..., 9
- When using these you need to ensure that you map these digits or combinations of the digits to match the probabilities of events
- For example, use random numbers to sample students from class
    + There are 188 students in the class
    + Each student, or possible group of students, needs to have an equal chance of being selected
    + Need to use three sequential digits
    + BUT there are 1000 three digit numbers, so either we will throw away 822 of them, or we could map a person to multiple numbers (5) and throw away only 60
    + If any person is selected more than once, throw out repeats

---
# Assign every class member a number

```
#> # A tibble: 188 x 4
#>        first     last section number
#>        <chr>    <chr>   <chr>  <chr>
#>  1     Ahmed    Nehal    2420    000
#>  2       Ang      Xin    2420    001
#>  3   Bradley    Tyler    2420    002
#>  4       Bui    Hoang    2420    003
#>  5       Bui   Nathan    2420    004
#>  6   Bundhoo Mokshada    2420    005
#>  7   Bundhoo  Urvashi    2420    006
#>  8 Castricum     Liam    2420    007
#>  9  Cavanagh    James    2420    008
#> 10      Chan   Ernest    2420    009
#> # ... with 178 more rows
```

---
# Generate random digits

```
#>    [1] 0 8 0 9 0 8 9 6 5 5 3 0 8 6 3 1 9 9 8 9 6 7 0 2 2 7 6 4 0 2 0 8 4 3
#>   [35] 7 9 4 9 1 6 6 7 7 8 8 9 5 6 6 5 5 7 2 6 1 7 1 2 6 4 4 4 9 6 5 8 9 2
#>   [69] 7 2 7 2 3 4 2 3 7 3 6 8 0 9 3 4 9 9 1 3 5 9 9 8 6 1 5 7 5 9 4 8 9 8
#>  [103] 4 3 5 2 2 6 9 7 3 6 7 9 3 0 1 5 7 2 4 7 5 2 5 5 4 3 1 1 5 1 3 4 9 7
#>  [137] 8 2 7 3 7 8 8 2 4 8 1 4 4 5 8 1 2 3 5 4 8 8 1 1 7 6 5 2 8 2 3 0 9 6
#>  [171] 4 6 6 5 2 5 8 3 4 2 4 9 5 2 5 8 4 2 1 4 0 1 8 5 5 3 3 6 0 6 6 1 6 3
#>  [205] 7 4 9 9 3 7 7 4 9 6 6 9 9 2 5 5 9 9 5 4 1 7 8 9 0 4 3 3 8 5 5 6 4 6
#>  [239] 0 8 2 4 0 5 2 1 1 9 4 4 2 5 8 3 5 8 9 2 9 1 1 8 6 4 8 3 6 7 3 1 1 5
#>  [273] 8 9 1 5 9 3 0 1 9 8 7 5 0 1 6 5 9 2 6 2 5 9 3 5 8 5 3 8 7 6 1 9 4 0
#>  [307] 5 7 9 9 3 5 5 3 2 7 1 9 8 9 2 1 9 1 6 0 7 7 5 9 9 4 3 7 7 3 8 9 0 2
#>  [341] 6 6 6 0 3 1 9 7 6 4 3 7 9 1 6 5 2 0 2 7 7 8 8 0 0 6 5 2 7 0 6 7 4 8
#>  [375] 7 7 4 1 7 3 4 2 7 7 1 0 0 0 3 8 1 8 8 2 8 5 5 0 0 7 2 9 1 2 9 8 4 7
#>  [409] 1 2 4 8 3 4 6 1 9 4 4 0 0 3 0 1 0 1 1 2 9 5 0 6 9 4 7 2 0 3 6 1 4 5
#>  [443] 1 0 4 7 7 2 8 9 5 3 9 9 8 4 8 5 3 7 6 6 5 2 0 1 0 2 2 3 1 0 4 1 3 0
#>  [477] 1 7 7 2 1 2 4 0 8 6 8 0 9 8 2 3 8 9 5 1 3 2 8 6 1 5 9 5 2 7 2 8 7 8
#>  [511] 9 7 8 3 5 9 3 3 2 4 5 6 0 0 4 0 5 9 3 5 3 5 3 6 5 9 7 6 3 3 9 0 0 3
#>  [545] 4 6 2 3 3 4 8 5 8 4 7 0 2 8 9 9 1 5 5 6 2 1 0 6 9 4 2 0 3 3 9 8 2 2
#>  [579] 5 9 3 9 3 9 9 0 3 6 3 7 6 1 3 2 2 8 3 0 1 6 7 2 1 2 2 2 7 7 7 2 9 0
#>  [613] 7 8 0 6 9 4 3 9 8 6 7 4 1 0 8 4 7 9 2 1 1 2 6 3 5 0 3 6 0 8 0 4 5 3
#>  [647] 1 9 1 6 5 7 2 1 5 4 8 0 1 3 8 9 1 7 5 8 5 7 5 1 2 1 5 7 1 5 7 5 9 0
#>  [681] 0 0 3 8 3 0 3 9 8 0 0 9 7 8 2 4 1 1 4 1 4 6 7 6 4 8 0 9 5 7 3 4 0 6
#>  [715] 1 9 7 9 3 8 2 1 0 5 9 8 5 9 9 0 0 6 1 8 5 9 9 8 2 3 1 8 0 5 7 3 4 2
#>  [749] 9 9 9 0 3 0 2 8 3 8 8 7 7 6 6 2 4 1 7 9 9 6 2 8 5 3 1 0 3 8 7 6 8 2
#>  [783] 7 4 1 9 5 7 0 4 1 1 3 1 7 2 1 0 1 9 9 2 4 2 7 3 4 2 8 2 7 5 9 4 1 6
#>  [817] 3 7 0 4 1 2 1 5 2 1 0 5 1 9 9 2 3 1 4 2 6 0 2 9 2 5 0 4 0 8 2 4 1 6
#>  [851] 5 2 8 1 0 4 2 7 6 0 4 7 1 7 9 6 2 0 8 4 5 5 4 9 3 6 6 4 1 8 9 9 3 9
#>  [885] 4 0 6 6 3 5 7 7 2 3 1 2 8 0 6 3 2 5 9 9 8 3 7 6 9 0 4 2 1 5 4 7 9 5
#>  [919] 0 5 7 6 0 7 6 1 0 0 3 6 9 5 2 8 4 9 4 2 1 6 7 5 8 2 3 6 8 3 4 1 4 2
#>  [953] 2 7 7 3 4 2 6 3 3 2 2 9 9 2 0 4 7 4 6 2 4 6 5 2 0 8 1 5 7 2 2 3 8 6
#>  [987] 1 3 1 8 1 0 2 6 8 9 1 8 0 0 5 2
```

---
# Group in threes

```
#>       [,1] [,2] [,3]
#>  [1,]    0    8    0
#>  [2,]    9    0    8
#>  [3,]    9    6    5
#>  [4,]    5    3    0
#>  [5,]    8    6    3
#>  [6,]    1    9    9
#>  [7,]    8    9    6
#>  [8,]    7    0    2
#>  [9,]    2    7    6
#> [10,]    4    0    2
#> [11,]    0    8    4
#> [12,]    3    7    9
#> [13,]    4    9    1
#> [14,]    6    6    7
#> [15,]    7    8    8
#> [16,]    9    5    6
#> [17,]    6    5    5
#> [18,]    7    2    6
#> [19,]    1    7    1
#> [20,]    2    6    4
```

---
# Throw away numbers > 187

```
#>  [1]  80  84 171 157 151 176  66 163  43 158 165  27   6  67 100  38 129
#> [18] 124  30 101
```

---
# Find class members

```
#> # A tibble: 20 x 4
#>        first       last section number
#>        <chr>      <chr>   <chr>  <chr>
#>  1  Moulding       Ryan    2420    080
#>  2    Nguyen      Jason    2420    084
#>  3       Soo    Matthew    5242    171
#>  4   Freitas      Filho    5242    157
#>  5      Zhou    Jinghao    2420    151
#>  6        Vu        Thi    5242    176
#>  7       Liu    Yucheng    2420    066
#>  8       Lau    Vincent    5242    163
#>  9    Ingram    Timothy    2420    043
#> 10  Gunasena      Geema    5242    158
#> 11       Lim       Zhee    5242    165
#> 12       Gee   Harrison    2420    027
#> 13   Bundhoo    Urvashi    2420    006
#> 14       Liu     Zhaoqi    2420    067
#> 15    Sandhu   Jaskirat    2420    100
#> 16      Hewa Atapattuge    2420    038
#> 17     Vuong       Jone    2420    129
#> 18      Tong  Zhengqing    2420    124
#> 19    Grewal      Sahil    2420    030
#> 20 Schmierer      Corey    2420    101
```

---
# Compute proportion

Estimated proportion is:

```
#> [1] 0.7
```

True proportion is 156/188=0.81.

---
# Simpler approach

```r
class_all %>% sample_n(20) 
#> # A tibble: 20 x 3
#>         first      last section
#>         <chr>     <chr>   <chr>
#>  1         La       Gia    2420
#>  2         Lu   Junrong    2420
#>  3        Liu   Yucheng    2420
#>  4         Wu  Xiaoxiao    5242
#>  5 Prathivadi    Pranay    2420
#>  6        Kim     Yejin    2420
#>  7   Tjoaquin   Calista    2420
#>  8       Tran      Minh    5242
#>  9        Mao     Haoyu    2420
#> 10     Soares    Stefan    2420
#> 11        Lai  Benjamin    2420
#> 12      Zhang       Hui    2420
#> 13   Varghese    Adarsh    2420
#> 14       Miao    Yupeng    2420
#> 15      Scott    Ridley    2420
#> 16      Zheng Jianxiang    2420
#> 17       Yang     Huiyi    2420
#> 18   Soenarto Cristofer    2420
#> 19        Lao     Tommy    2420
#> 20    Jackson  Danielle    2420
```

---
# Statistical distributions

- Uniform
- Normal 
- Exponential
- Binomial
- Pareto
- Weibull
- Gamma
- Lognormal

---
# Random numbers = Uniform

- symmetric, unimodal, uniform
- e.g. `$U\{0, ..., 9\}$`
- e.g. `$P(X=x) = f(x) = 1/10, ~~ x \in \{0, ..., 9\}$`

---
# Normal distribution

- Gaussian, bell-shaped
- symmetric, unimodal
- `$N(\mu, \sigma)$`

`$$f(x~|~\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}, ~~~ -\infty<x<\infty$$`

---
# Exponential distribution

`$$f(x~|~\lambda) = e^{-\lambda x} ~~ x\geq 0$$`

- right-skewed, unimodal
- `$Exp(\lambda)$`
- Arises in time between or duration of events, e.g. time between successive failures of a machine, duration of a phone call to a help center

---
# Poisson distribution

`$$P(X=x~|~\lambda) = \frac{\lambda^x e^{-\lambda}}{x!} ~~ x \in \{0, 1, 2, ...\}$$`

- discrete, right-skewed, unimodal
- Arises when counting number of times event occurs in an interval of time, e.g. the number of patients arriving in an emergency room between 11 and 12 pm

---
# Binomial

`$$P(X=x~|~n,p) = \left(\begin{array}{c} n \\ p \end{array} \right) p^x (1-p)^{n-x} ~~ x \in \{0, 1, 2, ..., n\}$$`

- discrete, unimodal, right- or left-skewed or unimodal depending on `$p$`
- Arises from counting the number of successes from `$n$` independent Bernouilli trials, e.g. the number of heads in 10 coin flips

---
# Pareto

$$
f(x~|~\alpha, \lambda) = \frac{\alpha\lambda^\alpha}{(\lambda+x)^{\alpha+1}} ~~~x>0, \alpha>0, \lambda>0 
$$

- Used to describe allocation of wealth, sizes of human settlement
- Heavier tailed than exponential distribution

---
# Weibull

`$$f(x~|~\lambda, k) = \frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1} e^{(-x/\lambda)^k}, ~~~ x\geq 0$$`

- used for particle size distribution, failure analysis, delivery time, extreme value theory
- shape changes considerably with different `$k$`

---
# Gamma

$$f(x~|~\alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-x\beta}, ~~~ x\geq 0 ~~\alpha, \beta > 0 $$

- Generalisation of exponential distribution, and also `$\chi^2$`
- `$\alpha$` changes shape substantially
- used to model size of insurance claims, rainfall

---
# Lognormal

- Also called Galton's distribution
- Generated when `$Y\sim N(\mu, \sigma)$`, and study `$X=exp(Y)$`
- used for modeling length of comments posted in internet discussion forums, users' dwell time on the online articles, size of living tissue, highly communicable epidemics

---
# Sampling variability

---
# Probability calculations

- Probability density functions are useful for computing expected quantities
- E.g. Gamma(2,1), what is the probability of seeing `$X>3.2$`, or `$1.5<X<2.5$`

```r
pgamma(3.2, 2, lower.tail=FALSE)
#> [1] 0.17
pgamma(2.5, 2) - pgamma(1.5, 2)
#> [1] 0.27
```

---
class: inverse middle 
# Your turn

- Continuous distributions: Area under the curve = ______
- Discrete distributions: Sum of probabilities = ______

---
# Resources

- [NIST Statistics Handbook](http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm)
- [random.org](https://www.random.org/randomness/)
- [Radioactive decay](https://www.fourmilab.ch/hotbits/)
- [electromagnetic field of a vacuum](https://qrng.anu.edu.au)
- [wikipedia](https://en.wikipedia.org/wiki/List_of_probability_distributions)

---
class: inverse middle 
# Share and share alike

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.