---
title: "ETC 2420/5242 Lab 6 2017"
author: "Di Cook"
date: "Week 6"
output: pdf_document
---
```{r echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
error = FALSE,
echo = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 8,
fig.align = "center",
cache = FALSE
)
library(knitr)
```
## Purpose
This lab is to practice fitting and diagnosing multiple linear regression models.
## Reading
- The web site [OECD PISA](http://www.oecd.org/education/singapore-tops-latest-oecd-pisa-global-education-survey.htm) has a lot of information about the data. Click on the `TTest your skills online by answering some PISA 2016 science questions` and do some of the questions that students had to answer. How many did you get right out of how many attempted? ______
- Read the material on fitting multiple regression models in [Statistics online textbook, Diez, Barr, Cetinkaya-Rundel](https://www.openintro.org/stat/textbook.php?stat_book=isrs).
- Read the code in the lecture notes on diagnostics for linear models from Week 5.
## Warmup exercises
- We are going to take a look at the OECD PISA 2015 data focusing on Australia.
```{r echo=TRUE}
library("tidyverse")
library("forcats")
load("pisa_au.rda")
```
There are ten values for each student for the science score. The explanation for why this is, is long, but long story short, the raw scores that a student earns in the test are not distributed, but rather a large linear model is constructed, and ten predictions are randomly generated for each student from the model. Below is a scatterplot matrix plot of the plausible scores for each student of Australia. You can see that the scores are pretty similar across the variables, because the correlation is high and the scatter is strongly linear.
```{r echo=TRUE, eval=FALSE}
library("GGally")
sci_scores <- pisa_au %>% select(PV1SCIE, PV2SCIE, PV3SCIE, PV4SCIE,
PV5SCIE, PV6SCIE, PV7SCIE, PV8SCIE,
PV9SCIE, PV10SCIE) %>% as.data.frame()
ggscatmat(sci_scores, alpha=0.1)
```
We will create a new variable which is the average of the ten scores for each student.
```{r echo=TRUE}
pisa_au <- pisa_au %>%
mutate(science = (PV1SCIE+PV2SCIE+PV3SCIE+PV4SCIE+
PV5SCIE+PV6SCIE+PV7SCIE+PV8SCIE+
PV9SCIE+PV10SCIE)/10)
```
Students are tested at many different schools. How many schools? And what is the distribution of number of students tested at each school?
```{r echo=TRUE}
pisa_au %>% group_by(CNTSCHID) %>%
tally() %>%
arrange(desc(n)) -> aus_schools
dim(aus_schools)
ggplot(aus_schools, aes(x=n)) +
geom_density(fill="black", alpha=0.5)
```
A dictionary of variables that we will use further (in addition to the `science` variable we just created) is as follows:
\begin{tabular}{llp{3in}}
Variable name & Description & Coding\\\hline
ST004D01T & Gender & 1=Female, 2=Male\\
OUTHOURS & Out-of-School Study Time \\
ANXTEST & Personality: Test Anxiety (WLE) \\
EMOSUPP & Parental emotional support (WLE) \\
PARED & Index highest parental education in years of schooling\\
JOYSCIE & Enjoyment of science (WLE)\\
WEALTH & Family wealth (WLE) \\
ST013Q01TA & How many books are there in your home? 1 0-10 books,
2 11-25 books,
3 26-100 books,
4 101-200 books,
5 201-500 books,
6 More than 500 books\\
ST012Q01TA & How many in your home: Televisions; 1 None,
2 One,
3 Two,
4 Three or more\\
SENWT & Weight & Reflects how the student represents other students in Australia based on socioeconomic and demographic characteristics
\end{tabular}
Subset the data to contain just these variables.
```{r echo=TRUE}
pisa_au <- pisa_au %>%
select(science, ST004D01T, OUTHOURS, ANXTEST, EMOSUPP, PARED, JOYSCIE, WEALTH, ST013Q01TA, ST012Q01TA, SENWT)
```
Make summaries each of the variables, to examine their suitability for modeling.
```{r echo=TRUE}
summary(pisa_au)
```
- EMOSUPP is all missing, cannot use this variable.
- OUTHOURS has about one third missing, might be unreliable
- JOYSCIE has 10% missing, might be unreliable
Actions to take:
- Drop EMOSUPP
- Remove any case with missings - and check numbers remaining
```{r echo=TRUE}
pisa_au <- pisa_au %>% select(-EMOSUPP)
aus_nomiss <- pisa_au %>% filter(!is.na(OUTHOURS)) %>%
filter(!is.na(ANXTEST)) %>% filter(!is.na(PARED)) %>%
filter(!is.na(JOYSCIE)) %>% filter(!is.na(WEALTH)) %>%
filter(!is.na(ST013Q01TA)) %>% filter(!is.na(ST012Q01TA))
```
The number of students (observations) drops from `r nrow(pisa_au)` to `r nrow(aus_nomiss)` about 5000. That's a lot. If we ignored OUTHOURS there would be a loss of much less data, about 2000 records.
The students all have weights associated with them. This is an indication of how many other students they represent in Australia, relative to their socioeconomic and demographic characteristics. Let's look at the distribution of weights
```{r}
ggplot(aus_nomiss, aes(x=SENWT)) + geom_density(fill="black", alpha=0.5)
```
There is a lot of variation in the weights. The weights are bimodal (is the bimodality due to one of the variables in the study that we are using for the model? Its not due to gender!) with a few very large ones. It looks like we will need to take weight into account in the model.
```{r}
ggplot(aus_nomiss, aes(x=SENWT)) +
facet_wrap(~ST004D01T) +
geom_density(fill="black", alpha=0.5)
```
Model building will be done using:
- Response: `science` (standardised)
- Explanatory variables: all the remaining variables
Some variables need to be treated as categorical variables, so it is best if they are forced to be factors before modeling:
```{r echo=TRUE}
aus_nomiss <- aus_nomiss %>%
mutate(ST004D01T=factor(ST004D01T, levels=c(1,2), labels=c("f", "m")))
aus_nomiss <- aus_nomiss %>%
mutate(science_std = (science-mean(science))/sd(science))
```
Test the model fitting, by fitting a model for science against gender, and joy of science.
```{r echo=TRUE}
aus_glm_test <- glm(science_std~ST004D01T+JOYSCIE,
data=aus_nomiss, weights=SENWT)
summary(aus_glm_test)
```
Sketch what this model looks like.
## Question 1
- Make plots of the response variable `science_std` against each of the possible explanatory variables.
- Which variables look like they should be most important for predicting the response?
## Question 2
- Fit the weighted multiple regression model to all the explanatory variables.
```{r eval=FALSE}
aus_glm <- glm(science_std~ST004D01T+OUTHOURS+ANXTEST+PARED+JOYSCIE+WEALTH+ST013Q01TA+ST012Q01TA, data=aus_nomiss, weights=SENWT)
summary(aus_glm)
```
- Summarise the coefficients for the model fit.
- Not all variables are significant in the model. What variables can be dropped? Re-fit the model with this subset.
```{r eval=FALSE}
aus_glm <- glm(science_std~ST004D01T + OUTHOURS + ANXTEST +
PARED + JOYSCIE + WEALTH + ST013Q01TA + ST012Q01TA,
data=aus_nomiss, weights=SENWT)
summary(aus_glm)
```
## Question 3
- Compute the leverage and influence statistics.
```{r eval=FALSE}
library("broom")
aus_glm_augment <- augment(aus_glm)
ggplot(aus_glm_augment, aes(x=.hat, y=.cooksd)) + geom_point()
```
- What value would be considered to be the cutoff for considering a case to have high leverage?
- How many cases have high influence?
## Question 4
- Plot the observed vs fitted values. How good is the model for predicting science score? (Is it weak, moderate or strong?)
```{r eval=FALSE}
ggplot(aus_glm_augment, aes(x=.fitted, y=science_std)) + geom_point()
```
- Plot residuals vs fitted. What do you learn about the model fit by looking at this plot?
```{r eval=FALSE}
ggplot(aus_glm_augment, aes(x=.fitted, y=.resid)) + geom_point()
```
- Make a histogram of residuals, and a qqplot (normal probability plot). Do these look like a sample from a normal model?
```{r eval=FALSE}
p1 <- ggplot(aus_glm_augment, aes(x=.resid)) + geom_histogram()
n <- nrow(aus_nomiss)
aus_glm_augment$q = qnorm(c(1 - 0.5^(1/n), (2:(n-1) - 0.3175) /
(n + 0.365),0.5^(1/n)), 2, 0.5)
p2 <- ggplot(aus_glm_augment, aes(x=sort(.resid), y=q)) + geom_point() +
geom_smooth(method="lm", se=FALSE) + theme(aspect.ratio=1)
library("gridExtra")
grid.arrange(p1, p2, ncol=2)
```
## Question 5
Compute the variance inflation factors. Do these indicate collinearity between predictors that needs to be addressed?
```{r eval=FALSE}
library("car")
vif(aus_glm)
```
## Question 6
Interpret the model, by answering these questions:
- For boys how much do science scores increase or decrease on average, in relative to girls, keeping everything else fixed?
- For each additional hour spent studying how much does science score increase on average, keeping everything else fixed?
- For a household with more than 100 books does the average science score change, keeping all else fixed?
## Question 7
This plot shows `science_std` plotted against JOYSCIE separately by ST004D01T (gender). Is there evidence that an interaction term should be fitted to the model? Explain.
```{r}
ggplot(aus_nomiss, aes(x=JOYSCIE, y=science_std, colour=ST004D01T)) +
#geom_point(alpha=0.2) +
geom_smooth(method="lm")
```
## Question 8
Find the best model for science scores, which could include interaction terms, and justify your choice.
## TURN IN
- Your `.Rmd` file
- Your `.html` file that results from knitting the Rmd.
- Make sure your group members are listed as authors, one person per group will turn in the report
## Resources
- [Statistics online textbook, Diez, Barr, Cetinkaya-Rundel](https://www.openintro.org/stat/textbook.php?stat_book=isrs).