---
title: "ETC 2420/5242 Lab 5 2017"
author: "Di Cook"
date: "Week 5"
output: html_document
---

```{r echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  error = FALSE, 
  echo = TRUE, 
  collapse = TRUE,
  comment = "#",
  fig.height = 4,
  fig.width = 8,
  fig.align = "center",
  cache = FALSE
)
library(knitr)
```

## Purpose

This lab is to practice fitting linear models. 

## Reading

- Read the material on fitting regression models in [Statistics online textbook, Diez, Barr, Cetinkaya-Rundel](https://www.openintro.org/stat/textbook.php?stat_book=isrs). 

- Read the code in the lecture notes of the second class from Week 4. 
- Watch the movie [Hans Rosling's TED talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=en)

## Warmup exercises

- We are going to take a look at the gapminder data used in the first lab.
- The data has demographics of life expectancy and GDP per capita for 142 countries reported every 5 years between 1952 and 2007.

```{r echo=TRUE, fig.align='center', fig.width=7, fig.height=6}
library("tidyverse")
library("gapminder")
glimpse(gapminder)
```

- How would you describe the following plot?

```{r echo=TRUE, fig.align='center', fig.width=6, fig.height=4}
ggplot(data=gapminder, aes(x=year, y=lifeExp, group=country)) +
  geom_line(alpha=0.5)
```

- 1950 is the first year, so for model fitting we are going to shift year to begin in 1950, makes interpretability easier.

```{r echo=TRUE}
gapminder2 <- gapminder %>% mutate(year1950 = year-1950)
```

- Then let's fit a model for Australia

```{r echo=TRUE, fig.align='center', fig.width=6, fig.height=4}
oz <- gapminder2 %>% filter(country=="Australia")
head(oz)
ggplot(data=oz, aes(x=year, y=lifeExp)) + 
  geom_point() + 
  geom_smooth(method="lm", se=FALSE)
oz_lm <- lm(lifeExp~year1950, data=oz)
oz_lm
```

- Interpret the model. (This means explain how life expectancy changes over years, since 1950, using the parameter estimates of the model.)

- What was the average life expectancy in 1950?

- What was the average life expectancy in 2000?

```{r}
coef <- coefficients(oz_lm)
coef[1] + coef[2] * 50
```


- By how much did average life expectancy change over those 50 years?

- We can get various diagnostics out for the model with the `broom` package: the parameter estimates and their significance, the goodness of fit statistics, and model diagnostics. 

```{r echo=TRUE, fig.align='center', fig.width=6, fig.height=4}
summary(oz_lm)
library("broom")
oz_coef <- tidy(oz_lm)
oz_coef
oz_fit <- glance(oz_lm)
oz_fit
oz_diag <- augment(oz_lm)
oz_diag
```

- What column of the diagnostics contains the (a) fitted values, (b) residuals?

- Look up what the `.hat` column contains. Write a couple of sentences about the meaning of this statistic. Plot the values for Australia, as a dotplot. They should all be pretty similar because there are no influential points for this model.

```{r}
ggplot(oz_diag, aes(x = .hat)) +
  geom_dotplot(binwidth = 0.02)
```


## Question 1

The plot of all the countries was really hard to explain. It was very messy. Trying to say something about how life expectancy has changed around the globe was hard. 

Now we are going to fit a simple linear model separately to every country. And use the model fits to simplify the patterns across the globe, in order to be able to explain the changes in life expectancy.

This code will compute the models for you:

```{r echo=TRUE}
library("purrr")
by_country <- gapminder2 %>% 
  select(country, year1950, lifeExp, continent) %>%
  group_by(country, continent) %>% 
  nest()
by_country <- by_country %>% 
  mutate(
    model = purrr::map(data, ~ lm(lifeExp ~ year1950, 
                                  data = .))
  )
country_coefs <- by_country %>% 
  unnest(model %>% purrr::map(broom::tidy))
country_coefs <- country_coefs %>% 
  select(country, continent, term, estimate) %>% 
  spread(term, estimate)
kable(head(country_coefs))
country_coefs %>%
  filter(country == "Australia")
```

Here is simpler code using a `for` loop to compute the slope and intercept for each country.

```{r echo=TRUE}
n <- length(table(gapminder2$country))
country_coefs <- data.frame(country=gapminder2$country[seq(1, 1704, 12)],
                   continent=gapminder2$continent[seq(1, 1704, 12)],
                   intercept=rep(0,n), 
                   year1950=rep(0,n))
for (i in 1:n) {
  sub <- gapminder2 %>% filter(country==country_coefs$country[i])
  sub_lm <- lm(lifeExp~year1950, data=sub)
  sub_lm_coefs <- coefficients(sub_lm)
  country_coefs$intercept[i] <- sub_lm_coefs[1]
  country_coefs$year1950[i] <- sub_lm_coefs[2]
}
kable(head(country_coefs))
```

- Pick your favorite country (other than Australia). Find the parameter estimates from the `country_coefs` data frame. Do a hand-sketch of the fitted model. 

```{r}
set.seed(2017)
sample(gapminder$country, size = 4)
```

```{r}
country_coefs %>%
  filter(country == "Japan")
```


## Question 2

a. Make a scatterplot of the linear model estimates for each country, slope vs intercept. Colour the points by continent.

```{r eval=TRUE}
ggplot(country_coefs, aes(x=intercept, y=year1950, colour=continent)) +
  geom_point() +
  scale_color_brewer(palette = "Dark2")
```

b. Statistically summarise the relationship between intercept and slope, using words like no association, positive linear association, negative linear association, weak, moderate, strong, outliers, clusters.

c. Do you see a difference between continents? If so, explain what you see.

d. What does it mean for a country to have a high intercept, e.g. 70?

e. What does it mean for a country to have a high slope, e.g. 0.7?

f. Make the plot interactive using the `plotly` package, and find out which countries had a negative slope. 

```{r eval=TRUE}
ggplot(country_coefs, aes(x=intercept, y=year1950, 
                          colour=continent, label=country)) +
  geom_point() +
  scale_color_brewer(palette = "Dark2")
library(plotly)
ggplotly()
```

## Question 3

Now we are going to examine the fit for each country. We might expect that a linear model is a better fit for some countries and not so good for other countries. Here is the code to extract the model diagnostics for each country's model. 

```{r echo=TRUE}
country_fit <- by_country %>% 
  unnest(model %>% 
           purrr::map(broom::glance))
```

Or you can use a `for` loop to compute this:

```{r echo=TRUE}
n <- length(unique(gapminder2$country))
country_fit <- data.frame(country=gapminder2$country[seq(1, 1704, 12)],
                   continent=gapminder2$continent[seq(1, 1704, 12)],
                   intercept=rep(0,n), 
                   year1950=rep(0,n),
                   r.squared=rep(0,n))
for (i in 1:n) {
  sub <- gapminder2 %>% filter(country==country_fit$country[i])
  sub_lm <- lm(lifeExp~year1950, data=sub)
  sub_lm_fit <- coefficients(sub_lm)
  country_fit$intercept[i] <- sub_lm_coefs[1]
  country_fit$year1950[i] <- sub_lm_coefs[2]
  country_fit$r.squared[i] <- summary(sub_lm)$r.squared
}
kable(head(country_fit))
```

a. Plot the $R^2$ values as a histogram. 
```{r eval=TRUE}
ggplot(country_fit, aes(x=r.squared)) + geom_histogram()
```
b. Statistically describe the distribution using words like symmetric, skewed left, skewed right, unimodal, bimodal, multimodal, outliers. 
c. What do you learn about the model fit across all 142 countries?

## Question 4

a. Examine the countries with the worst fit, countries with $R^2<0.45$, by making scatterplots of the data, with the linear model overlaid. 
```{r eval=TRUE}
badfit <- country_fit %>% filter(r.squared <= 0.45)
gapminder2_sub <- gapminder2 %>% filter(country %in% badfit$country)
ggplot(data=gapminder2_sub, aes(x=year, y=lifeExp)) + 
         geom_point() +
  facet_wrap(~country) +
  scale_x_continuous(breaks=seq(1950,2000,10), 
                     labels=c("1950", "60","70", "80","90","2000")) +
  geom_smooth(method="lm", se=FALSE)
```
b. Each of these countries has a big dip in their life expectancy during the time of the study. Explain these using world history and current affairs information. (Feel free to google for news stories.)
c. Use the statistics to investigate something of your choice related to life expectancy across the world over those 50 years. 


## TURN IN 

- Your `.Rmd` and `.html` file
- Make sure your group members are listed as authors, one person per group will turn in the report

## Resources

- [Statistics online textbook, Diez, Barr, Cetinkaya-Rundel](https://www.openintro.org/stat/textbook.php?stat_book=isrs). 
- Watch the movie [Hans Rosling's TED talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=en)