---
title: "ETC 2420/5242 Lab 5 2017"
author: "Di Cook"
date: "SOLUTION"
output: pdf_document
---
```{r echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
error = FALSE,
echo = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 8,
fig.align = "center",
cache = FALSE
)
library(knitr)
```
```{r fig.align='center', fig.width=7, fig.height=6}
library("tidyr")
library("dplyr")
library("ggplot2")
library("gapminder")
```
```{r}
gapminder2 <- gapminder %>% mutate(year1950 = year-1950)
```
```{r fig.align='center', fig.width=6, fig.height=4}
library("broom")
```
## Question 1 (1pt)
```{r echo=FALSE}
library("purrr")
by_country <- gapminder2 %>%
select(country, year1950, lifeExp, continent) %>%
group_by(country, continent) %>%
nest()
by_country <- by_country %>%
mutate(
model = purrr::map(data, ~ lm(lifeExp ~ year1950,
data = .))
)
country_coefs <- by_country %>%
unnest(model %>% purrr::map(broom::tidy))
country_coefs <- country_coefs %>%
select(country, continent, term, estimate) %>%
spread(term, estimate)
```
```{r echo=FALSE}
n <- length(table(gapminder2$country))
country_coefs <- data.frame(country=gapminder2$country[seq(1, 1704, 12)],
continent=gapminder2$continent[seq(1, 1704, 12)],
intercept=rep(0,n),
year1950=rep(0,n))
for (i in 1:n) {
sub <- gapminder2 %>% filter(country==country_coefs$country[i])
sub_lm <- lm(lifeExp~year1950, data=sub)
sub_lm_coefs <- coefficients(sub_lm)
country_coefs$intercept[i] <- sub_lm_coefs[1]
country_coefs$year1950[i] <- sub_lm_coefs[2]
}
```
- Pick your favorite country (other than Australia). Find the parameter estimates from the `country_coefs` data frame. Do a hand-sketch of the fitted model. `Various answers, but should show the correct intercept and slope with axes marked and labels.`
## Question 2 (5 pts, none for a)
a. Make a scatterplot of the linear model estimates for each country, slope vs intercept. Colour the points by continent.
```{r echo=TRUE}
ggplot(country_coefs, aes(x=intercept, y=year1950, colour=continent)) +
geom_point()
```
b. Statistically summarise the relationship between intercept and slope, using words like no association, positive linear association, negative linear association, weak, moderate, strong, outliers, clusters. `The association is negative moderate and linear, with some clustering by continent.`
c. Do you see a difference between continents? If so, explain what you see. `Africa shows the lowest intercepts and most variation in the slope. Europe is high on intercept and low on slope. Asia, like the Americas, is varied on intercept but relatively high on slope.`
d. What does it mean for a country to have a high intercept, e.g. 70? `The life expectancy in 1950 was quite high, e.g. 70 years.`
e. What does it mean for a country to have a high slope, e.g. 0.7? `The country had dramatic increase in life expectancy over the years. A value of 0.7 means life expectancy increased by 7 years for every decade.`
f. Make the plot interactive using the `plotly` package, and find out which countries had a negative slope. `Rwanda, Zambia, Zimbabwe`
```{r echo=TRUE}
ggplot(country_coefs, aes(x=intercept, y=year1950,
colour=continent, label=country)) +
geom_point()
```
```{r echo=TRUE, eval=FALSE}
library(plotly)
ggplotly()
```
## Question 3 (2pts, none for a)
```{r echo=FALSE}
country_fit <- by_country %>%
unnest(model %>%
purrr::map(broom::glance))
```
```{r echo=FALSE}
n <- length(unique(gapminder2$country))
country_fit <- data.frame(country=gapminder2$country[seq(1, 1704, 12)],
continent=gapminder2$continent[seq(1, 1704, 12)],
intercept=rep(0,n),
year1950=rep(0,n),
r.squared=rep(0,n))
for (i in 1:n) {
sub <- gapminder2 %>% filter(country==country_fit$country[i])
sub_lm <- lm(lifeExp~year1950, data=sub)
sub_lm_fit <- coefficients(sub_lm)
country_fit$intercept[i] <- sub_lm_coefs[1]
country_fit$year1950[i] <- sub_lm_coefs[2]
country_fit$r.squared[i] <- summary(sub_lm)$r.squared
}
```
a. Plot the $R^2$ values as a histogram.
```{r}
ggplot(country_fit, aes(x=r.squared)) + geom_histogram()
```
b. Statistically describe the distribution using words like symmetric, skewed left, skewed right, unimodal, bimodal, multimodal, outliers. `The distributionn is skewed left, with a few really low values.`
c. What do you learn about the model fit across all 142 countries? `For most countries, the fit is very good. There are some countries where a linear model does not fit at all.`
## Question 4 (2pts, none for a)
a. Examine the countries with the worst fit, countries with $R^2<0.45$, by making scatterplots of the data, with the linear model overlaid.
```{r}
badfit <- country_fit %>% filter(r.squared <= 0.45)
gapminder2_sub <- gapminder2 %>% filter(country %in% badfit$country)
ggplot(data=gapminder2_sub, aes(x=year, y=lifeExp)) +
geom_point() +
facet_wrap(~country) +
scale_x_continuous(breaks=seq(1950,2000,10),
labels=c("1950", "60","70", "80","90","2000")) +
geom_smooth(method="lm", se=FALSE)
```
b. Each of these countries has a big dip in their life expectancy during the time of the study. Explain these using world history and current affairs information. (Feel free to google for news stories.)`Civil wars and AIDS`
c. Use the statistics to investigate something of your choice related to life expectancy across the world over those 50 years. `Various answers, could notice that the top R2 countries have very linear growth in life expectancy, or genocide in Cambodia, or country that had the most dramatic increase,...`