Statistical Thinking using Randomisation and Simulation

class: center, middle, inverse, title-slide

# Statistical Thinking using Randomisation and Simulation
## Multilevel Models
### Di Cook (<a href="mailto:dicook@monash.edu">dicook@monash.edu</a>, <span class="citation">@visnut</span>)
### W7.C1

---

# Overview of this class

- Fixed effects vs random effects
- Mixed effects models
- Diagnostics

---
# What is a multilevel model?

- Observations are not independent, but belong to a hierarchy 
- Example: individual level demographics (age, gender), and school level information (location, cours offerings, classroom resources)
- Multilevel model enables fitting different types of dependencies

---
# Fixed vs random

- `Fixed effects` can be used when you know all the categories, e.g. age, gender, smoking status
- `Random effects` are used when not all groups are captured, and we have a random selection of the groups, e.g. individuals (if you have multiple measurements), schools, hospitals

---
# Mixed effects models - a type of multilevel model

For data organized in `$g$` groups, consider a continuous response linear mixed-effects model (LME model) for each group `$i$`, `$i=1, \ldots, g$`:

`$$\underset{(n_i \times 1)}{{\bf y}_i} =  \underset{(n_i \times p)}{{\bf X}_i} \underset{(p \times 1)}{{\boldsymbol \beta}} + \underset{(n_i \times q)}{{\bf Z}_i} \underset{(q \times 1)}{{\bf b}_i} + \underset{(n_i \times 1)}{{\bf \varepsilon}_i}$$`

- `${\bf y}_i$` is the vector of outcomes for the `$n_i$` level-1 units in group `$i$`
- `${\bf X}_i$` and `${\bf Z}_i$` are design matrices for the fixed and random effects
- `${\boldsymbol \beta}$` is a vector of `$p$` fixed effects governing the global mean structure
- `${\bf b}_i$` is a vector of `$q$` random effects for between-group covariance 
- `${\bf \varepsilon}_i$` is a vector of level-1 error terms for within-group covariance

---
# Example

- Data: `$radon$`, 919 owner-occupied homes in 85 counties of Minnesota. Available in the `HLMdiag` package
- Response: `$log.radon$`
- Fixed: `$storey$` (categorical)
- Covariate: `$uranium$` (quantitative)
- Random: `$county$` (house is a member of county)

```
## Observations: 919
## Variables: 5
## $ log.radon   <dbl> 0.7885, 0.7885, 1.0647, 0.0000, 1.1314, 0.9163, 0....
## $ storey      <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ uranium     <dbl> -0.689, -0.689, -0.689, -0.689, -0.847, -0.847, -0...
## $ county      <int> 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ county.name <fctr> AITKIN, AITKIN, AITKIN, AITKIN, ANOKA, ANOKA, ANO...
```

---
# Take a look

Plot of response vs covariate. What do you see?

---
# Here's what we see

- Vertical stripes: each county is represented by an average uranium value
- Weak linear association, lots of variation for houses within county
- Four points inline horizontally at the base (be suspicious)
- Some counties only have 2, 3 points
- Scales?

---
# Pre-processing

- Counties with less than 4 observations removed
- Four flat-line observations removed, really suspect these were erroneously coded missing values

---
# Look again

---
# Fit a simple model

`$$log.radon = \beta_0 + \beta_1 storey + \beta_2 uranium + \varepsilon$$`

```
## 
## Call:
## glm(formula = log.radon ~ storey + uranium, data = radon_sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6610  -0.4928   0.0191   0.4745   2.4205  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.4483     0.0313   46.25  < 2e-16 ***
## storeyfirst floor  -0.6112     0.0733   -8.34  3.3e-16 ***
## uranium             0.8359     0.0742   11.26  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.547)
## 
##     Null deviance: 538.51  on 795  degrees of freedom
## Residual deviance: 434.02  on 793  degrees of freedom
## AIC: 1784
## 
## Number of Fisher Scoring iterations: 2
```

---
class: inverse middle 
# Your turn

- What is the intercept?
- What is the slope?
- What does the coefficient labelled `storeyfirst floor` mean?
- Make a sketch of what this model looks like. 
- Does the model match the pattern observed in the data?

---
# Fit an interaction term

```
## 
## Call:
## glm(formula = log.radon ~ storey * uranium, data = radon_sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6445  -0.4898   0.0131   0.4653   2.4369  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.4580     0.0318   45.91  < 2e-16 ***
## storeyfirst floor          -0.6659     0.0796   -8.37  2.7e-16 ***
## uranium                     0.8909     0.0805   11.07  < 2e-16 ***
## storeyfirst floor:uranium  -0.3620     0.2066   -1.75     0.08 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.546)
## 
##     Null deviance: 538.51  on 795  degrees of freedom
## Residual deviance: 432.34  on 792  degrees of freedom
## AIC: 1783
## 
## Number of Fisher Scoring iterations: 2
```

---
# What does this model look like?

---
class: inverse middle 
# Your turn

Write down the equation of the fitted model.

--
If story is basement, then 
`$\hat{y}=$` 1.458 `$+$` 0.891 `$\times$` uranium

and if story is first floor, then 
`$\hat{y}=$` 0.792 `$+$` 0.529 `$\times$` uranium

---
# Mixed effects model

`$$log.radon_{ij} = \beta_0 + \beta_1 storey_{ij} + \beta_2 uranium_i + b_{0i} + b_{1i} storey_{ij} + \varepsilon_{ij}$$`

`$$~~~ i=1, ..., \#counties; j=1, ..., n_i$$`

```r
library(lme4)
radon_lmer <- lmer(log.radon ~ storey + uranium + 
  (storey | county.name), data = radon_sub)
summary(radon_lmer)
radon_lmer_fit <- augment(radon_lmer) 
```

---
class: inverse middle 
# Your turn

For the radon data:

- What is `$p$` (number of fixed effects), `$q$` (number of random effects), `$g$` (number of groups)?
- And hence `$n_i, i=1, \dots, g$`?

`$$log.radon_{ij} = \beta_0 + \beta_1 storey_{ij} + \beta_2 uranium_i + b_{0i} + b_{1i} storey_{ij} + \varepsilon_{ij}$$`

`$$~~~ i=1, ..., \#counties; j=1, ..., n_i$$`

---
# Examining the model output: fixed effects

```
Fixed effects:
                    Estimate Std. Error t value
(Intercept)          1.48066    0.03856   38.40
storeyfirst floor   -0.59011    0.11246   -5.25
uranium              0.84600    0.09532    8.88
```

How do these compare with the simple linear model estimates?

```
Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.44830    0.03131  46.254  < 2e-16 ***
storeyfirst floor -0.61125    0.07332  -8.337 3.35e-16 ***
uranium            0.83591    0.07422  11.262  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

---
# Examining the model output: random effects

```
Random effects:
 Groups      Name                Variance Std.Dev. Corr
 county.name (Intercept)         0.01388  0.1178       
             storeyfirst floor   0.22941  0.4790   0.02
 Residual                        0.50694  0.7120       
Number of obs: 796, groups:  county.name, 46
```

This is saying that the variance of the estimates for first floor observations is larger than the storey.

---
# What it looks like

---
# Or like this

---
class: inverse middle 
# Your turn

How does the mixed effects model differ from the simple linear model? (Hint: Think about the variance.)

---
# Assumptions

Recall:

- `${\bf b}_i$` is a random sample from `$\mathcal{N}({\bf 0}, {\bf D})$` and independent from the level-1 error terms,  
- `${\boldsymbol \varepsilon}_i$` follow a `$\mathcal{N}({\bf 0},\sigma^2 {\bf R}_i)$` distribution 
- `${\bf D}$` is a positive-definite `$q \times q$` covariance matrix and `${\bf R}_i$` is a positive-definite `$n_i \times n_i$` covariance matrix

---
# Extract and examine level-1 residuals

`${\boldsymbol \varepsilon}_i \sim \mathcal{N}({\bf 0},\sigma^2 {\bf R}_i)$`

Level-1 (observation level) look normal.

---

![](mixed_effects.pdf)

---
# QQ-plot

Level-1 (observation level) do look nearly normal.

---
# Examine within group

Summary statistics

```
## # A tibble: 20 x 4
##    county.name        m     s     n
##         <fctr>    <dbl> <dbl> <int>
##  1       ANOKA  0.05096 0.719    52
##  2    BELTRAMI  0.33508 0.867     7
##  3  BLUE EARTH  0.15188 0.562    14
##  4     CARLTON -0.19433 0.651    10
##  5      CARVER  0.32231 0.924     5
##  6        CASS  0.38263 0.504     5
##  7     CHISAGO  0.15472 0.845     6
##  8        CLAY  0.16252 0.885    14
##  9   CROW WING  0.04541 0.679    12
## 10      DAKOTA -0.04153 0.717    63
## 11     DOUGLAS  0.07803 0.329     9
## 12   FARIBAULT -0.42081 0.311     5
## 13    FREEBORN  0.31357 0.723     9
## 14     GOODHUE  0.15537 0.719    14
## 15    HENNEPIN -0.00737 0.662   105
## 16     HOUSTON -0.13553 0.483     6
## 17     HUBBARD -0.02903 0.712     5
## 18      ITASCA  0.00484 0.584    11
## 19     JACKSON  0.24493 0.689     5
## 20 KOOCHICHING -0.10696 0.466     7
```

---

---
# Learn

There is some difference on average between counties, which means that residuals still have some structure related to the county location.

---
# Normality tests

Anderson-Darling, Cramer-von Mises, Lilliefors (Kolmogorov-Smirnov)

```
## 
## 	Anderson-Darling normality test
## 
## data:  radon_lmer_fit$resid1
## A = 0.4, p-value = 0.4
```

all believe that the residuals are consistent with normality.

---
# Conclusion about level-1 residuals

The assumption:

`$${\boldsymbol \varepsilon}_i \sim \mathcal{N}({\bf 0},\sigma^2 {\bf R}_i)$$`

is probably ok, at the worst it is not badly violated.

---
# Random effects

`$${\bf b}_i \sim \mathcal{N}({\bf 0}, {\bf D}), ~~~ i=1, \dots g$$`

where `${\bf D}$` allows for correlation between random effects within group, and these should be independent from the level-1 error

We have both intercepts (basement) and slopes (first floor)

---

---

---
# Should be no correlation

---
# Fitted vs Observed

Plotting the observed vs fitted values, gives a sense for how much of the response is explained by the model. Here we can see that there is still a lot of unexplained variation.

---
# Goodness of fit

From the linear model

```
##   null.deviance df.null logLik  AIC  BIC deviance df.residual
## 1           539     795   -887 1783 1806      432         792
```

From the random effects model

```
##   sigma logLik  AIC  BIC deviance df.residual
## 1 0.712   -885 1784 1817     1760         789
```

Hmmm... deviance looks strange! Compute sum of squares of residuals instead:

```
## [1] 387
```

Which model is best?

---
# Influence

No overly influential observations

---
# Resources

- [HLMDiag package explanation](https://www.jstatsoft.org/article/view/v056i05)
- [HLM package](https://cran.r-project.org/web/packages/HLMdiag/index.html)

---
class: inverse middle 
# Share and share alike

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.