class: center, middle, inverse, title-slide # Statistical Thinking using Randomisation and Simulation ## Statistical distributions ### Di Cook (
dicook@monash.edu
,
@visnut
) ### W3.C1 --- # Overview of this class - Random numbers - Mapping random numbers to events for simulation - Statistical distributions - Density functions --- # Random numbers - True random number generators: [Radioactive decay](https://www.fourmilab.ch/hotbits/), [electromagnetic field of a vacuum](https://qrng.anu.edu.au) - Computers only technically provide pseudo-random numbers, using deterministic process, e.g linear congruential, for large `\(a, b, m\)` `$$X_{n+1} = (aX_n + b) ~~mod ~~m$$` --- # RANDU - a bad PRNG - Used in the 60s and onwards `$$X_{n+1} = 65539 X_n ~~mod ~~2^{31}$$` <img src="week3.class1_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Mersenne Twister - algorithm is a twisted generalised feedback shift register (TGFSR) - based on a Marsenne prime, `\(2^m-1\)` - most commonly used today - each integer will occur the same number of times in a period --- # Using random numbers to estimate things - Suppose we want to estimate the proportion of the class in 2420 vs 5242 - Now we know that there are 36 5242 and 152 2420 students enrolled in the class. BUT SUPPOSE WE DON'T KNOW THIS! - We see a random sample of 20 students, and have to guess what the proportion for the whole class is. --- # Using random numbers - Random number tables (old fashioned) deliver single digits 0, 1, ..., 9 - When using these you need to ensure that you map these digits or combinations of the digits to match the probabilities of events - For example, use random numbers to sample students from class + There are 188 students in the class + Each student, or possible group of students, needs to have an equal chance of being selected + Need to use three sequential digits + BUT there are 1000 three digit numbers, so either we will throw away 822 of them, or we could map a person to multiple numbers (5) and throw away only 60 + If any person is selected more than once, throw out repeats --- # Assign every class member a number ``` #> # A tibble: 188 x 4 #> first last section number #> <chr> <chr> <chr> <chr> #> 1 Ahmed Nehal 2420 000 #> 2 Ang Xin 2420 001 #> 3 Bradley Tyler 2420 002 #> 4 Bui Hoang 2420 003 #> 5 Bui Nathan 2420 004 #> 6 Bundhoo Mokshada 2420 005 #> 7 Bundhoo Urvashi 2420 006 #> 8 Castricum Liam 2420 007 #> 9 Cavanagh James 2420 008 #> 10 Chan Ernest 2420 009 #> # ... with 178 more rows ``` --- # Generate random digits ``` #> [1] 0 8 0 9 0 8 9 6 5 5 3 0 8 6 3 1 9 9 8 9 6 7 0 2 2 7 6 4 0 2 0 8 4 3 #> [35] 7 9 4 9 1 6 6 7 7 8 8 9 5 6 6 5 5 7 2 6 1 7 1 2 6 4 4 4 9 6 5 8 9 2 #> [69] 7 2 7 2 3 4 2 3 7 3 6 8 0 9 3 4 9 9 1 3 5 9 9 8 6 1 5 7 5 9 4 8 9 8 #> [103] 4 3 5 2 2 6 9 7 3 6 7 9 3 0 1 5 7 2 4 7 5 2 5 5 4 3 1 1 5 1 3 4 9 7 #> [137] 8 2 7 3 7 8 8 2 4 8 1 4 4 5 8 1 2 3 5 4 8 8 1 1 7 6 5 2 8 2 3 0 9 6 #> [171] 4 6 6 5 2 5 8 3 4 2 4 9 5 2 5 8 4 2 1 4 0 1 8 5 5 3 3 6 0 6 6 1 6 3 #> [205] 7 4 9 9 3 7 7 4 9 6 6 9 9 2 5 5 9 9 5 4 1 7 8 9 0 4 3 3 8 5 5 6 4 6 #> [239] 0 8 2 4 0 5 2 1 1 9 4 4 2 5 8 3 5 8 9 2 9 1 1 8 6 4 8 3 6 7 3 1 1 5 #> [273] 8 9 1 5 9 3 0 1 9 8 7 5 0 1 6 5 9 2 6 2 5 9 3 5 8 5 3 8 7 6 1 9 4 0 #> [307] 5 7 9 9 3 5 5 3 2 7 1 9 8 9 2 1 9 1 6 0 7 7 5 9 9 4 3 7 7 3 8 9 0 2 #> [341] 6 6 6 0 3 1 9 7 6 4 3 7 9 1 6 5 2 0 2 7 7 8 8 0 0 6 5 2 7 0 6 7 4 8 #> [375] 7 7 4 1 7 3 4 2 7 7 1 0 0 0 3 8 1 8 8 2 8 5 5 0 0 7 2 9 1 2 9 8 4 7 #> [409] 1 2 4 8 3 4 6 1 9 4 4 0 0 3 0 1 0 1 1 2 9 5 0 6 9 4 7 2 0 3 6 1 4 5 #> [443] 1 0 4 7 7 2 8 9 5 3 9 9 8 4 8 5 3 7 6 6 5 2 0 1 0 2 2 3 1 0 4 1 3 0 #> [477] 1 7 7 2 1 2 4 0 8 6 8 0 9 8 2 3 8 9 5 1 3 2 8 6 1 5 9 5 2 7 2 8 7 8 #> [511] 9 7 8 3 5 9 3 3 2 4 5 6 0 0 4 0 5 9 3 5 3 5 3 6 5 9 7 6 3 3 9 0 0 3 #> [545] 4 6 2 3 3 4 8 5 8 4 7 0 2 8 9 9 1 5 5 6 2 1 0 6 9 4 2 0 3 3 9 8 2 2 #> [579] 5 9 3 9 3 9 9 0 3 6 3 7 6 1 3 2 2 8 3 0 1 6 7 2 1 2 2 2 7 7 7 2 9 0 #> [613] 7 8 0 6 9 4 3 9 8 6 7 4 1 0 8 4 7 9 2 1 1 2 6 3 5 0 3 6 0 8 0 4 5 3 #> [647] 1 9 1 6 5 7 2 1 5 4 8 0 1 3 8 9 1 7 5 8 5 7 5 1 2 1 5 7 1 5 7 5 9 0 #> [681] 0 0 3 8 3 0 3 9 8 0 0 9 7 8 2 4 1 1 4 1 4 6 7 6 4 8 0 9 5 7 3 4 0 6 #> [715] 1 9 7 9 3 8 2 1 0 5 9 8 5 9 9 0 0 6 1 8 5 9 9 8 2 3 1 8 0 5 7 3 4 2 #> [749] 9 9 9 0 3 0 2 8 3 8 8 7 7 6 6 2 4 1 7 9 9 6 2 8 5 3 1 0 3 8 7 6 8 2 #> [783] 7 4 1 9 5 7 0 4 1 1 3 1 7 2 1 0 1 9 9 2 4 2 7 3 4 2 8 2 7 5 9 4 1 6 #> [817] 3 7 0 4 1 2 1 5 2 1 0 5 1 9 9 2 3 1 4 2 6 0 2 9 2 5 0 4 0 8 2 4 1 6 #> [851] 5 2 8 1 0 4 2 7 6 0 4 7 1 7 9 6 2 0 8 4 5 5 4 9 3 6 6 4 1 8 9 9 3 9 #> [885] 4 0 6 6 3 5 7 7 2 3 1 2 8 0 6 3 2 5 9 9 8 3 7 6 9 0 4 2 1 5 4 7 9 5 #> [919] 0 5 7 6 0 7 6 1 0 0 3 6 9 5 2 8 4 9 4 2 1 6 7 5 8 2 3 6 8 3 4 1 4 2 #> [953] 2 7 7 3 4 2 6 3 3 2 2 9 9 2 0 4 7 4 6 2 4 6 5 2 0 8 1 5 7 2 2 3 8 6 #> [987] 1 3 1 8 1 0 2 6 8 9 1 8 0 0 5 2 ``` --- # Group in threes ``` #> [,1] [,2] [,3] #> [1,] 0 8 0 #> [2,] 9 0 8 #> [3,] 9 6 5 #> [4,] 5 3 0 #> [5,] 8 6 3 #> [6,] 1 9 9 #> [7,] 8 9 6 #> [8,] 7 0 2 #> [9,] 2 7 6 #> [10,] 4 0 2 #> [11,] 0 8 4 #> [12,] 3 7 9 #> [13,] 4 9 1 #> [14,] 6 6 7 #> [15,] 7 8 8 #> [16,] 9 5 6 #> [17,] 6 5 5 #> [18,] 7 2 6 #> [19,] 1 7 1 #> [20,] 2 6 4 ``` --- # Throw away numbers > 187 ``` #> [1] 80 84 171 157 151 176 66 163 43 158 165 27 6 67 100 38 129 #> [18] 124 30 101 ``` --- # Find class members ``` #> # A tibble: 20 x 4 #> first last section number #> <chr> <chr> <chr> <chr> #> 1 Moulding Ryan 2420 080 #> 2 Nguyen Jason 2420 084 #> 3 Soo Matthew 5242 171 #> 4 Freitas Filho 5242 157 #> 5 Zhou Jinghao 2420 151 #> 6 Vu Thi 5242 176 #> 7 Liu Yucheng 2420 066 #> 8 Lau Vincent 5242 163 #> 9 Ingram Timothy 2420 043 #> 10 Gunasena Geema 5242 158 #> 11 Lim Zhee 5242 165 #> 12 Gee Harrison 2420 027 #> 13 Bundhoo Urvashi 2420 006 #> 14 Liu Zhaoqi 2420 067 #> 15 Sandhu Jaskirat 2420 100 #> 16 Hewa Atapattuge 2420 038 #> 17 Vuong Jone 2420 129 #> 18 Tong Zhengqing 2420 124 #> 19 Grewal Sahil 2420 030 #> 20 Schmierer Corey 2420 101 ``` --- # Compute proportion Estimated proportion is: ``` #> [1] 0.7 ``` True proportion is 156/188=0.81. --- # Simpler approach ```r class_all %>% sample_n(20) #> # A tibble: 20 x 3 #> first last section #> <chr> <chr> <chr> #> 1 La Gia 2420 #> 2 Lu Junrong 2420 #> 3 Liu Yucheng 2420 #> 4 Wu Xiaoxiao 5242 #> 5 Prathivadi Pranay 2420 #> 6 Kim Yejin 2420 #> 7 Tjoaquin Calista 2420 #> 8 Tran Minh 5242 #> 9 Mao Haoyu 2420 #> 10 Soares Stefan 2420 #> 11 Lai Benjamin 2420 #> 12 Zhang Hui 2420 #> 13 Varghese Adarsh 2420 #> 14 Miao Yupeng 2420 #> 15 Scott Ridley 2420 #> 16 Zheng Jianxiang 2420 #> 17 Yang Huiyi 2420 #> 18 Soenarto Cristofer 2420 #> 19 Lao Tommy 2420 #> 20 Jackson Danielle 2420 ``` --- # Statistical distributions - Uniform - Normal - Exponential - Binomial - Pareto - Weibull - Gamma - Lognormal --- # Random numbers = Uniform <img src="week3.class1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> - symmetric, unimodal, uniform - e.g. `\(U\{0, ..., 9\}\)` - e.g. `\(P(X=x) = f(x) = 1/10, ~~ x \in \{0, ..., 9\}\)` --- # Normal distribution - Gaussian, bell-shaped - symmetric, unimodal - `\(N(\mu, \sigma)\)` `$$f(x~|~\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}, ~~~ -\infty<x<\infty$$` <img src="week3.class1_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Exponential distribution `$$f(x~|~\lambda) = e^{-\lambda x} ~~ x\geq 0$$` - right-skewed, unimodal - `\(Exp(\lambda)\)` - Arises in time between or duration of events, e.g. time between successive failures of a machine, duration of a phone call to a help center <img src="week3.class1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Poisson distribution `$$P(X=x~|~\lambda) = \frac{\lambda^x e^{-\lambda}}{x!} ~~ x \in \{0, 1, 2, ...\}$$` - discrete, right-skewed, unimodal - Arises when counting number of times event occurs in an interval of time, e.g. the number of patients arriving in an emergency room between 11 and 12 pm <img src="week3.class1_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- # Binomial `$$P(X=x~|~n,p) = \left(\begin{array}{c} n \\ p \end{array} \right) p^x (1-p)^{n-x} ~~ x \in \{0, 1, 2, ..., n\}$$` - discrete, unimodal, right- or left-skewed or unimodal depending on `\(p\)` - Arises from counting the number of successes from `\(n\)` independent Bernouilli trials, e.g. the number of heads in 10 coin flips <img src="week3.class1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Pareto $$ f(x~|~\alpha, \lambda) = \frac{\alpha\lambda^\alpha}{(\lambda+x)^{\alpha+1}} ~~~x>0, \alpha>0, \lambda>0 $$ - Used to describe allocation of wealth, sizes of human settlement - Heavier tailed than exponential distribution <img src="week3.class1_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # Weibull `$$f(x~|~\lambda, k) = \frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1} e^{(-x/\lambda)^k}, ~~~ x\geq 0$$` - used for particle size distribution, failure analysis, delivery time, extreme value theory - shape changes considerably with different `\(k\)` <img src="week3.class1_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # Gamma $$f(x~|~\alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-x\beta}, ~~~ x\geq 0 ~~\alpha, \beta > 0 $$ - Generalisation of exponential distribution, and also `\(\chi^2\)` - `\(\alpha\)` changes shape substantially - used to model size of insurance claims, rainfall <img src="week3.class1_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # Lognormal - Also called Galton's distribution - Generated when `\(Y\sim N(\mu, \sigma)\)`, and study `\(X=exp(Y)\)` - used for modeling length of comments posted in internet discussion forums, users' dwell time on the online articles, size of living tissue, highly communicable epidemics <img src="week3.class1_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- # Sampling variability <img src="week3.class1_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- # Probability calculations - Probability density functions are useful for computing expected quantities - E.g. Gamma(2,1), what is the probability of seeing `\(X>3.2\)`, or `\(1.5<X<2.5\)` ```r pgamma(3.2, 2, lower.tail=FALSE) #> [1] 0.17 pgamma(2.5, 2) - pgamma(1.5, 2) #> [1] 0.27 ``` <img src="week3.class1_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- class: inverse middle # Your turn - Continuous distributions: Area under the curve = ______ - Discrete distributions: Sum of probabilities = ______ --- # Resources - [NIST Statistics Handbook](http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm) - [random.org](https://www.random.org/randomness/) - [Radioactive decay](https://www.fourmilab.ch/hotbits/) - [electromagnetic field of a vacuum](https://qrng.anu.edu.au) - [wikipedia](https://en.wikipedia.org/wiki/List_of_probability_distributions) --- class: inverse middle # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.