Statistical Thinking using Randomisation and Simulation

class: center, middle, inverse, title-slide

# Statistical Thinking using Randomisation and Simulation
## Fitting Models
### Di Cook (<a href="mailto:dicook@monash.edu">dicook@monash.edu</a>, <span class="citation">@visnut</span>)
### W4.C1

---

# Overview of this class

- Fitting a distribution for olympic medal tallies

---
# Olympic medals, 2012 London

---
# Data

- Extracted from [https://www.olympic.org/london-2012](https://www.olympic.org/london-2012)
- Now it is easier to pull data from [wikipedia](https://en.wikipedia.org/wiki/2012_Summer_Olympics_medal_table)
- 204 countries participated, only countries that scored a medal (85) are listed in the medal table

---
# Medal tally

- Examine the distribution of medal counts
- Need to add 119 zeros, to account for participating countries that did not get a medal
- Distribution is right-skewed, heavily, and unimodal
- Use maximum likelihood to estimate parameters for plausible distributions

---
# Fit distribution using Poisson

```
#>   lambda 
#>   4.72   
#>  (0.15)
```

---
# Try lognormal

```
#>   meanlog    sdlog 
#>   0.779     1.137  
#>  (0.080)   (0.056)
```

<img src="week4.class1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
 
---
# Try weibull

```
#>    shape    scale 
#>   0.707    4.106  
#>  (0.033)  (0.434)
```

<img src="week4.class1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
 
---
# Try pareto

```
#>     c  
#>   1.28 
#>  (0.09)
```

---
# Optimization actually fails

---
# Manually

Actually using `\(c=0.96\)`.

---
# Predict largest medal count

Using this model, what is the probability of observing a tally of more than 50 medals for a country? `\(P(X>50)\)`

```r
ppareto <- function(q, c) {
  if (c<=0) stop("c must be positive > 0")
  ifelse(q<1, 0, 1-1/q^c)
}
1-ppareto(50, 0.96)
#> [1] 0.023
```

---
# How many would we expect?

If there are 204 countries, how many of them would we expect to earn more than 50 medals, assuming the `\(Pareto(0.96)\)` model?

```r
204*(1-ppareto(50, 0.96))
#> [1] 4.8
```

and how does this compare to the observed number?

```r
library(dplyr)
df %>% filter(Total>50)
#>   Total
#> 1    65
#> 2    82
#> 3    88
#> 4   104
```

---
# How well does this fit 2008 medal tally?

---
# And 2004?

---
# Doping in sports - finding anomalies

![](athletics-women.png)

Source: FT research, image extracted from [http://blogs.ft.com/ftdata/2015/11/16/doping-in-athletics/](http://blogs.ft.com/ftdata/2015/11/16/doping-in-athletics/)

---
# YOUR TURN: How could we improve the model?

---
#

- What dependencies are there in the medal tallies?
- What varies among Olympic years?
- What factors might affect the medal counts?

---
# Resources

- [2012 Medal tally](https://en.wikipedia.org/wiki/2012_Summer_Olympics_medal_table)
- [2008 Medal tally](https://en.wikipedia.org/wiki/2008_Summer_Olympics_medal_table)
- [2004 Medal tally](https://en.wikipedia.org/wiki/2004_Summer_Olympics_medal_table)
- [http://blogs.ft.com/ftdata/2015/11/16/doping-in-athletics/](http://blogs.ft.com/ftdata/2015/11/16/doping-in-athletics/)

---
class: inverse middle 
# Share and share alike

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.