--- title: 'Statistical Thinking using Randomisation and Simulation' subtitle: 'Compiling data for problem solving' author: Di Cook (dicook@monash.edu, @visnut) date: "W10.C1" output: xaringan::moon_reader: css: ["default", "myremark.css"] self_contained: false nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- ```{r setup, include = FALSE} library(knitr) opts_chunk$set( message = FALSE, warning = FALSE, cache = FALSE, echo=FALSE, fig.align='center', fig.height = 4, fig.width = 4, collapse = TRUE, comment = "#>" ) library(tidyverse) library(gridExtra) ``` # Overview of this class - What is `tidy data`? Why do you want tidy data? Getting your data into tidy form using tidyr. - Wrangling verbs: `filter`, `arrange`, `select`, `mutate`, `summarise`, with dplyr - Date and time with lubridate --- # Terminology 1. `Cases, records, individuals, subjects, experimental units, example, instance`: things we are collecting information about 2. `Variables, attributes, fields, features`: what we are measuring on each record/case/.../instance Generally we think of cases being on the rows, and variables being in the columns of a table. This is a basic data structure. BUT data often is given to us in many other shapes than this. Getting into a tidy shape will allow you to efficiently use it for modeling. --- # Example 1 ```{r echo=FALSE} grad <- read_csv("../data/graduate-programs.csv") head(grad[1:4,c(2,3,4,6)]) ```
- Cases: __________ - Variables: __________ ```{r echo=FALSE, eval=FALSE} - Cases: graduate programs - Variables: subject, Inst, AvNumPubs, ... ``` --- # Example 2 Data from weather stations available at [NCDC NOAA](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt) ```{r} melbtemp <- read.fwf("../data/ASN00086282.dly", c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)), fill=T) head(melbtemp[1:4,c(1,2,3,4,seq(5,128,4))]) ```
- Cases: __________ - Variables: __________ ```{r echo=FALSE, eval=FALSE} - Each row of data provided contains the values for one month! - Cases: days - Variables: TMAX, TMIN, PRCP, year, month, day, stationid. ``` --- # Example 3 Here are the column headers from a data set containing information on tuberculosis incidence by country across the globe for the last few decades ... ```{r} tb <- read_csv("../data/tb.csv") #tail(tb) colnames(tb) ```
- Cases: __________ - Variables: __________ ```{r echo=FALSE, eval=FALSE} - Each row of data provided contains the values for one month! - Cases: days - Variables: TMAX, TMIN, PRCP, year, month, day, stationid. ``` --- # Example 4 We'll commonly find these data on web sites: ```{r} pew <- read.delim( file = "http://stat405.had.co.nz/data/pew.txt", header = TRUE, stringsAsFactors = FALSE, check.names = F ) pew[1:5, 1:5] ``` - Cases: __________ - Variables: __________ --- # Example 5 10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows: ```{r, echo = FALSE} data(french_fries, package = "reshape2") kable(head(french_fries, 4), format = "html", row.names = F) ``` What do you like to know? --- ![](french_fries.png) --- # Messy data patterns There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns. - Column headers are values, not variable names - Variables are stored in both rows and columns, contingency table format - Information stored in multiple tables - Dates in many different formats - - Not easy to analyse --- # What is tidy data? - Each observation forms a row - Each variable forms a column - Contained in a single table - Long form makes it easier to reshape in many different ways - Wide form is common for analysis/modeling - - This form neatly fits into the statistical thinking framework, where you can anticipate the variables being a sample from some statistical distribution. --- --- --- # Tidy vs messy - Tidy data facilitates analysis in many different ways, answering multiple questions, applying methods to new data or other problems - Messy data may work for one particular problem but is not generalisable --- # Tidy verbs - `gather`: specify the `keys` (identifiers) and the `values` (measures) to make long form (used to be called melting) - `spread`: variables in columns (used to be called casting) - `nest/unnest`: working with lists - `separate/unite`: split and combine columns --- # French fries example ```{r, echo = FALSE} library(reshape2) library(tidyr) head(french_fries) ``` --- # This format is not ideal for data analysis What code would be needed to plot each of the ratings over time as a different color? ``` library(ggplot2) french_sub <- french_fries[french_fries$time == 10,] ggplot(data = french_sub) + geom_boxplot(aes(x="1_potato", y=potato), fill = I("red")) + geom_boxplot(aes(x = "2_buttery", y = buttery), fill = I("orange")) + geom_boxplot(aes(x = "3_grassy", y = grassy), fill = I("yellow")) + geom_boxplot(aes(x = "4_rancid", y = rancid), fill = I("green")) + geom_boxplot(aes(x = "5_painty", y = painty), fill = I("blue")) + xlab("variable") + ylab("rating") ``` --- # The plot ```{r, echo=FALSE, fig.width=6} library(ggplot2) french_sub <- french_fries %>% filter(time == 10) ggplot(data = french_sub) + geom_boxplot(aes(x="1_potato", y=potato), fill = I("red")) + geom_boxplot(aes(x = "2_buttery", y = buttery), fill = I("orange")) + geom_boxplot(aes(x = "3_grassy", y = grassy), fill = I("yellow")) + geom_boxplot(aes(x = "4_rancid", y = rancid), fill = I("green")) + geom_boxplot(aes(x = "5_painty", y = painty), fill = I("blue")) + xlab("variable") + ylab("rating") ``` --- # Wide to long ![](gather.png) --- # Gathering + When gathering, you need to specify the **keys** (identifiers) and the **values** (measures). + Keys/Identifiers: - Identify a record (must be unique) - Example: Indices on an random variable - Fixed by design of experiment (known in advance) - May be single or composite (may have one or more variables) + Values/Measures: - Collected during the experiment (not known in advance) - Usually numeric quantities --- # Gathering the French Fries data ``` ff_long <- gather(french_fries, key = variable, value = rating, potato:painty) head(ff_long) ``` ```{r, echo=F} ff_long <- gather(french_fries, key = variable, value = rating, potato:painty) head(ff_long) ``` --- # Let's re-write the code ``` ff_long_sub <- ff_long %>% filter(time == 10) ggplot(data = ff_long_sub, aes(x=variable, y=rating, fill = variable)) + geom_boxplot() ``` --- # And plot it ```{r, echo=FALSE, fig.width=6} ff_long_sub <- ff_long %>% filter(time == 10) ggplot(data = ff_long_sub, aes(x=variable, y=rating, fill = variable)) + geom_boxplot() ``` --- # Long to wide In certain applications, we may wish to take a long dataset and convert it to a wide dataset (Perhaps displaying in a table). ```{r echo=FALSE} head(ff_long) ``` --- # Spread We use the **spread** function from tidyr to do this: ``` ff_wide <- spread(ff_long, key = variable, value = rating) head(ff_wide) ``` ```{r echo=FALSE} ff_wide <- spread(ff_long, key = variable, value = rating) head(ff_wide) ``` --- # The split-apply-combine approach - *Split* a dataset into many smaller sub-datasets - *Apply* some function to each sub-dataset to compute a result - *Combine* the results of the function calls into a one dataset --- # The split-apply-combine approach --- # Split-apply-combine in dplyr ``` library(dplyr) ff_summary <- ff_long %>% group_by(variable) %>% # SPLIT summarise( m = mean(rating, na.rm = TRUE), s=sd(rating, na.rm=TRUE)) # APPLY + COMBINE ff_summary ``` ```{r echo=FALSE, message=FALSE, error=FALSE} library(dplyr) ff_summary <- ff_long %>% group_by(variable) %>% # SPLIT summarise( m = mean(rating, na.rm = TRUE), s=sd(rating, na.rm=TRUE)) # APPLY + COMBINE ff_summary ``` --- # Pipes - Pipes historically enable data analysis pipelines - Pipes allow the code to be read like a sequence of operations - dplyr allows us to chain together these data analysis tasks using the `%>%` (pipe) operator - `x %>% f(y)` is shorthand for `f(x, y)` - Example: ```{r echo=TRUE} student2012.sub <- readRDS("../data/student_sub.rds") student2012.sub %>% count(CNT) ``` --- # dplyr verbs There are five primary dplyr `verbs`, representing distinct data analysis tasks: - `Filter`: Remove the rows of a data frame, producing subsets - `Arrange`: Reorder the rows of a data frame - `Select`: Select particular columns of a data frame - `Mutate`: Add new columns that are functions of existing columns - `Summarise`: Create collapsed summaries of a data frame --- # Filter ``` french_fries %>% filter(subject == 3, time == 1) ``` ```{r echo=FALSE} french_fries %>% filter(subject == 3, time == 1) ``` --- # Arrange ``` french_fries %>% arrange(desc(rancid)) %>% head ``` ```{r echo=FALSE} french_fries %>% arrange(desc(rancid)) %>% head ``` --- # Select ``` french_fries %>% select(time, treatment, subject, rep, potato) %>% head ``` ```{r echo=FALSE} french_fries %>% select(time, treatment, subject, rep, potato) %>% head ``` --- # Mutate ``` french_fries %>% mutate(yucky = grassy+rancid+painty) %>% head ``` ```{r echo=FALSE} french_fries %>% mutate(yucky = grassy+rancid+painty) %>% head ``` --- # Summarise ``` french_fries %>% group_by(time, treatment) %>% summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid)) ``` ```{r echo=FALSE} french_fries %>% group_by(time, treatment) %>% summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid)) ``` --- # Dates and times - Dates are deceptively hard to work with - 02/05/2012. Is it February 5th, or May 2nd? - Time zones - Different starting times of stock markets, airplane departure and arrival --- # Basic lubridate use ```{r echo=TRUE} library(lubridate) now() now(tz = "America/Chicago") today() now() + hours(4) today() - days(2) ymd("2013-05-14") mdy("05/14/2013") dmy("14052013") ``` --- # Dates example: Oscars date of birth ```{r echo=TRUE} oscars <- read_csv("../data/oscars.csv") oscars <- oscars %>% mutate(DOB = mdy(DOB)) head(oscars$DOB) summary(oscars$DOB) ``` --- # Calculating on dates - You should never ask a woman her age, but ... really! ```{r echo=TRUE} oscars <- oscars %>% mutate(year=year(DOB)) summary(oscars$year) oscars %>% filter(year == "2029") %>% select(Name, Sex, DOB) ``` --- # Months ```{r, echo=TRUE} oscars <- oscars %>% mutate(month=month(DOB, label = TRUE, abbr = TRUE)) table(oscars$month) ``` --- # Now plot it ```{r echo=TRUE, fig.width=8, fig.height=4} ggplot(data=oscars, aes(month)) + geom_bar() ``` --- # Should you be born in April? ```{r echo=TRUE, fig.width=8, fig.height=4} df <- data.frame(m=sample(1:12, 423, replace=TRUE)) df$m2 <- factor(df$m, levels=1:12, labels=month.abb) ggplot(data=df, aes(x=m2)) + geom_bar() ``` --- # Resources - [Tidy data](http://vita.had.co.nz/papers/tidy-data.pdf) - [Split-apply-combine](http://vita.had.co.nz/papers/plyr.pdf) - [RStudio cheat sheets](https://www.rstudio.com/resources/cheatsheets/) - [Working with dates and times](https://www.jstatsoft.org/article/view/v040i03/v40i03.pdf) - [R for Data Science](http://r4ds.had.co.nz) --- class: inverse middle # Share and share alike Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.