class: center, middle, inverse, title-slide # Testing Hypotheses Using Permutation ### Dilini Talagala --- class: inverse, center, middle # Data structures [Learn more on data structures in R - read 'Advanced R' by Hadley Wickham](http://adv-r.had.co.nz/Data-structures.html) <img src="https://images.tandf.co.uk/common/jackets/amazon/978146658/9781466586963.jpg" width="200px" style="display: block; margin: auto;" /> --- ### Data structures - R's base data structures can be organised by their **dimensionality** (1d, 2d, or nd) and whether they are **homogeneous** or **heterogeneous** . - Most commonly used data types in data analysis: .pull-left[ **Homogeneous** #####*(All contents must be of the same type)* Atomic vector [1d] Matrix [2d] Array [nd] ] .pull-right[ **Heterogeneous** #####*(The contents can be of different types)* List [1d] Data frame [2d] ] --- ## 1. Vectors - Vectors in R are either - **atomic vectors** or - **lists** --- ### 1.1 Atomic vectors - All elements of an atomic vector must be the same type. - Common types of atomic vectors: ```r c(0.5, 0.6, 0.7) ## numeric (double) ``` ``` ## [1] 0.5 0.6 0.7 ``` ```r # With the L suffix, you get an integer rather than a double c(1L, 2L, 3L) ## integer ``` ``` ## [1] 1 2 3 ``` ```r c(TRUE, FALSE, TRUE) ## logical ``` ``` ## [1] TRUE FALSE TRUE ``` ```r c("a", "b", "c") ## character ``` ``` ## [1] "a" "b" "c" ``` --- ### 1.2 Lists - Lists are different from atomic vectors because their elements can be of different types, including lists. ```r x <- list(a = 1:3, b = c(TRUE, FALSE, TRUE), c = c(2.3, 5.9), d = list(y = c(1,2,3), z = c("A", "B"))) x ``` ``` ## $a ## [1] 1 2 3 ## ## $b ## [1] TRUE FALSE TRUE ## ## $c ## [1] 2.3 5.9 ## ## $d ## $d$y ## [1] 1 2 3 ## ## $d$z ## [1] "A" "B" ``` --- ```r x$b ``` ``` ## [1] TRUE FALSE TRUE ``` ```r x$d$z ``` ``` ## [1] "A" "B" ``` ```r str(x) ``` ``` ## List of 4 ## $ a: int [1:3] 1 2 3 ## $ b: logi [1:3] TRUE FALSE TRUE ## $ c: num [1:2] 2.3 5.9 ## $ d:List of 2 ## ..$ y: num [1:3] 1 2 3 ## ..$ z: chr [1:2] "A" "B" ``` --- ## 2. Matrices and arrays - Adding a dim() attribute to an atomic vector allows it to create a multi-dimensional array. - A special case of the array is the matrix, which has two dimensions. - Matrices are common. Arrays are much rarer. --- ### 2.1 Matrix ```r # Two scalar arguments to specify rows and columns a <- matrix(1:6, ncol = 3, nrow = 2) a ``` ``` ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ``` .pull-left[ ```r a[2, 3] #a[row, column] ``` ``` ## [1] 6 ``` ```r a[ , 3]#third column ``` ``` ## [1] 5 6 ``` ```r a[1, ]#first row ``` ``` ## [1] 1 3 5 ``` ] .pull-right[ ```r is.matrix(a) ``` ``` ## [1] TRUE ``` ```r is.array(a) ``` ``` ## [1] TRUE ``` ] --- ### 2.2 Array ```r # One vector argument to describe all dimensions b <- array(1:12, c(2, 3, 2)) b ``` ``` ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 7 9 11 ## [2,] 8 10 12 ``` --- ## 3. Data frames - A data frame is the most common way of storing data in R. - Few data frames that we are already familiar with: *economics*, *gapminder* ```r library(dplyr) data(economics, package = "ggplot2") glimpse(economics) ``` ``` ## Observations: 574 ## Variables: 6 ## $ date <date> 1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967... ## $ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.... ## $ pop <int> 198712, 198911, 199113, 199311, 199498, 199657, 19980... ## $ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6,... ## $ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4... ## $ unemploy <int> 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877,... ``` --- ```r data(gapminder, package = "gapminder") glimpse(gapminder) ``` ``` ## Observations: 1,704 ## Variables: 6 ## $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,... ## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi... ## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992... ## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8... ## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488... ## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78... ``` --- class: inverse, center, middle # Managing data frames with the <br/> dplyr package <img src="https://d21ii91i3y6o6h.cloudfront.net/gallery_images/from_proof/9295/large/1447175226/rstudio-hex-dplyr-dot-psd.png" width="200px" style="display: block; margin: auto;" /> --- ### Managing data frames with the dplyr package - [Learn more on 'Managing data frames with the dplyr package' - read 'R Programming for Data Science' by Roger D. Peng](https://bookdown.org/rdpeng/rprogdatascience/managing-data-frames-with-the-dplyr-package.html) - Some of the key "verbs" provided by the dplyr package are + **select()**: return a subset of the columns of a data frame + **filter()**: extract a subset of rows from a data frame + **arrange()**: reorder rows of a data frame + **rename()**: rename variables in a data frame + **mutate()**: add new variables/columns or transform existing variables + **group_by**: takes an existing tbl and converts it into a grouped tbl + **summarise()**: generate summary statistics of different variables in the data frame, possibly within groups --- class: inverse, center, middle # %>% operator <img src="https://d21ii91i3y6o6h.cloudfront.net/gallery_images/from_proof/9302/large/1447173978/rstudio-hex-pipe-dot-psd.png" width="200px" style="display: block; margin: auto;" /> #### *"Ceci n'est pas une pipe" - (This is not a pipe)* --- ## %>% operator - **%>%**: the "pipe" operator is used to connect multiple functions in a sequence of operations. #### Format: *second_fun( first_fun(x) )* - Difficult to read a sequence of operations ```r summarise(group_by(gapminder, continent), max = max(lifeExp)) ``` ``` ## # A tibble: 5 x 2 ## continent max ## <fctr> <dbl> ## 1 Africa 76.442 ## 2 Americas 80.653 ## 3 Asia 82.603 ## 4 Europe 81.757 ## 5 Oceania 81.235 ``` --- - %>% operator makes code more readable - Reads more naturally in a left-to-right fashion. ####Format: *x %>% first_fun() %>% second_fun* ```r gapminder %>% group_by(continent) %>% summarise(max = max(lifeExp)) ``` ``` ## # A tibble: 5 x 2 ## continent max ## <fctr> <dbl> ## 1 Africa 76.442 ## 2 Americas 80.653 ## 3 Asia 82.603 ## 4 Europe 81.757 ## 5 Oceania 81.235 ``` - Once you travel down the pipeline with %>%, the first argument is taken to be the output of the previous function in the pipeline. --- class: inverse, center, middle # Creating data frames with the <br/> tibble package <img src="http://hexb.in/hexagons/tibble.png" width="200px" style="display: block; margin: auto;" /> --- ## Creating a data frame with the tibble package - Learn more on tibbles read + ['R for Data Science' by Garrett Grolemund and Hadley Wickham](http://r4ds.had.co.nz/tibbles.html) + [Rstudio blog](https://blog.rstudio.com/2016/08/29/tibble-1-2-0/) ```r vignette("tibble") ``` - A data frame can be created using tibble(). ```r library(tibble) df <- tibble(x = 1:3, y = 3:1) df ``` ``` ## # A tibble: 3 x 2 ## x y ## <int> <int> ## 1 1 3 ## 2 2 2 ## 3 3 1 ``` --- ```r #The add_row()/ add_column() functions allows #control over where the new rows/columns are added df %>% add_row(x = 4, y = 0, .before = 2) ``` ``` ## # A tibble: 4 x 2 ## x y ## <dbl> <dbl> ## 1 1 3 ## 2 4 0 ## 3 2 2 ## 4 3 1 ``` ```r df %>% add_column(z = -1:1, .after = "x") ``` ``` ## # A tibble: 3 x 3 ## x z y ## <int> <int> <int> ## 1 1 -1 3 ## 2 2 0 2 ## 3 3 1 1 ``` --- ### Subsetting .pull-left[ ```r # Extract by name df$x ``` ``` ## [1] 1 2 3 ``` ```r df[["x"]] ``` ``` ## [1] 1 2 3 ``` ```r # Extract by position df[[1]] ``` ``` ## [1] 1 2 3 ``` ] .pull-right[ ```r # To use in a pipe, use # the special placeholder .: df %>% .$x ``` ``` ## [1] 1 2 3 ``` ```r df %>% .[["x"]] ``` ``` ## [1] 1 2 3 ``` ] --- ```r yawn_expt <- tibble(group = c(rep("control", 16), rep("treatment", 34)), yawn = c(rep("no", 12), rep("yes", 4), rep("no", 24), rep("yes", 10))) ``` --- Let's take a look at the data frame we created .pull-left[ ```r #Print out the first few rows head(yawn_expt) ``` ``` ## # A tibble: 6 x 2 ## group yawn ## <chr> <chr> ## 1 control no ## 2 control no ## 3 control no ## 4 control no ## 5 control no ## 6 control no ``` ```r #Get a glimpse of your data. glimpse(yawn_expt) ``` ``` ## Observations: 50 ## Variables: 2 ## $ group <chr> "control", "control", "control", "control", "control", "... ## $ yawn <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "n... ``` ] .pull-right[ ```r #Print out the last few rows tail(yawn_expt) ``` ``` ## # A tibble: 6 x 2 ## group yawn ## <chr> <chr> ## 1 treatment yes ## 2 treatment yes ## 3 treatment yes ## 4 treatment yes ## 5 treatment yes ## 6 treatment yes ``` ] --- ## Creating a contingency table from a data frame ```r library(dplyr) library(tidyr) library(knitr) yawn_expt %>% group_by(group, yawn) %>% tally() %>% ungroup() %>% spread(yawn, n) %>% mutate(total = rowSums(.[-1])) ``` ``` ## # A tibble: 2 x 4 ## group no yes total ## <chr> <int> <int> <dbl> ## 1 control 12 4 16 ## 2 treatment 24 10 34 ``` -- ####Your turn Compute the proportion of the treatment and control groups who yawned. Add this to the table. --- ## Permutation Test ```r prop_dif <- function(data){ dtbl <- data %>% mutate(yawn = sample(yawn)) #Permutate yawn variable # Yurn turn to compute the difference # between proportions of treaments and crontrol groups return(pdif) } ``` --- ## Setting the random number seed - Setting the random number seed with set.seed() ensures reproducibility of the sequence of random numbers. Compare the resulted outputs of the following commands: ```r set.seed(100) rnorm(5) ``` ``` ## [1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127 ``` ```r rnorm(5) ``` ``` ## [1] 0.3186301 -0.5817907 0.7145327 -0.8252594 -0.3598621 ``` ```r set.seed(100) rnorm(5) ``` ``` ## [1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127 ``` --- ### Run the function 10000 times, saving the results ```r set.seed(444) # here we create an empty numeric vector of #length 10000 to store our results pdif <- numeric(10000) ## Your turn to write the for-loop ``` --- class: inverse, center, middle # Plotting with ggplot2 <img src="https://d21ii91i3y6o6h.cloudfront.net/gallery_images/from_proof/9296/large/1447173871/rstudio-hex-ggplot2-dot-psd.png" width="200px" style="display: block; margin: auto;" /> --- ### Histogram ```r library(ggplot2) # 'economics' is the name of the data frame and # it has a variable called 'pce'. ggplot(data = economics, aes(x = pce)) + geom_histogram(binwidth = 500, colour = "blue", fill ="lightblue")+ geom_vline(xintercept = 10000 , colour = "red") ``` ![](index_files/figure-html/unnamed-chunk-27-1.png)<!-- --> - *binwidth is the width of the histogram bins* --- ### Your turn 1. Make a histogram of the results. 2. Draw a vertical line on the plot that represent the difference for the actual data. ```r pdif <- data.frame(pdif) # your turn to use ggplot to produce the histogram ``` --- class: inverse, center, middle Most of the material I've used here are based on ['Advanced R' by Hadley Wickham](http://adv-r.had.co.nz/) ['R Programming for Data Science' by Roger D. Peng](https://bookdown.org/rdpeng/rprogdatascience/) ['R for Data Science' by Garrett Grolemund and Hadley Wickham](http://r4ds.had.co.nz/) # Happy learning with R :)