dplyr package ```{r, out.width = "200px", echo=FALSE, fig.align='center'} knitr::include_graphics("https://d21ii91i3y6o6h.cloudfront.net/gallery_images/from_proof/9295/large/1447175226/rstudio-hex-dplyr-dot-psd.png") ``` --- ### Managing data frames with the dplyr package - [Learn more on 'Managing data frames with the dplyr package' - read 'R Programming for Data Science' by Roger D. Peng](https://bookdown.org/rdpeng/rprogdatascience/managing-data-frames-with-the-dplyr-package.html) - Some of the key "verbs" provided by the dplyr package are + **select()**: return a subset of the columns of a data frame + **filter()**: extract a subset of rows from a data frame + **arrange()**: reorder rows of a data frame + **rename()**: rename variables in a data frame + **mutate()**: add new variables/columns or transform existing variables + **group_by**: takes an existing tbl and converts it into a grouped tbl + **summarise()**: generate summary statistics of different variables in the data frame, possibly within groups --- class: inverse, center, middle # %>% operator ```{r, out.width = "200px", echo=FALSE, fig.align='center'} knitr::include_graphics("https://d21ii91i3y6o6h.cloudfront.net/gallery_images/from_proof/9302/large/1447173978/rstudio-hex-pipe-dot-psd.png") ``` #### *"Ceci n'est pas une pipe" - (This is not a pipe)* --- ## %>% operator - **%>%**: the "pipe" operator is used to connect multiple functions in a sequence of operations. #### Format: *second_fun( first_fun(x) )* - Difficult to read a sequence of operations ```{r } summarise(group_by(gapminder, continent), max = max(lifeExp)) ``` --- - %>% operator makes code more readable - Reads more naturally in a left-to-right fashion. ####Format: *x %>% first_fun() %>% second_fun* ```{r } gapminder %>% group_by(continent) %>% summarise(max = max(lifeExp)) ``` - Once you travel down the pipeline with %>%, the first argument is taken to be the output of the previous function in the pipeline. --- class: inverse, center, middle # Creating data frames with the

tibble package ```{r, out.width = "200px", echo=FALSE, fig.align='center'} knitr::include_graphics("http://hexb.in/hexagons/tibble.png") ``` --- ## Creating a data frame with the tibble package - Learn more on tibbles read + ['R for Data Science' by Garrett Grolemund and Hadley Wickham](http://r4ds.had.co.nz/tibbles.html) + [Rstudio blog](https://blog.rstudio.com/2016/08/29/tibble-1-2-0/) ```{r eval= FALSE} vignette("tibble") ``` - A data frame can be created using tibble(). ```{r} library(tibble) df <- tibble(x = 1:3, y = 3:1) df ``` --- ```{r} #The add_row()/ add_column() functions allows #control over where the new rows/columns are added df %>% add_row(x = 4, y = 0, .before = 2) df %>% add_column(z = -1:1, .after = "x") ``` --- ### Subsetting .pull-left[ ```{r} # Extract by name df$x df[["x"]] # Extract by position df[[1]] ``` ] .pull-right[ ```{r} # To use in a pipe, use # the special placeholder .: df %>% .$x df %>% .[["x"]] ``` ] --- ```{r} yawn_expt <- tibble(group = c(rep("control", 16), rep("treatment", 34)), yawn = c(rep("no", 12), rep("yes", 4), rep("no", 24), rep("yes", 10))) ``` --- Let's take a look at the data frame we created .pull-left[ ```{r} #Print out the first few rows head(yawn_expt) #Get a glimpse of your data. glimpse(yawn_expt) ``` ] .pull-right[ ```{r} #Print out the last few rows tail(yawn_expt) ``` ] --- ## Creating a contingency table from a data frame ```{r message = FALSE, warning = FALSE, error = FALSE} library(dplyr) library(tidyr) library(knitr) yawn_expt %>% group_by(group, yawn) %>% tally() %>% ungroup() %>% spread(yawn, n) %>% mutate(total = rowSums(.[-1])) ``` -- ####Your turn Compute the proportion of the treatment and control groups who yawned. Add this to the table. --- ## Permutation Test ```{r} prop_dif <- function(data){ dtbl <- data %>% mutate(yawn = sample(yawn)) #Permutate yawn variable # Yurn turn to compute the difference # between proportions of treaments and crontrol groups return(pdif) } ``` --- ## Setting the random number seed - Setting the random number seed with set.seed() ensures reproducibility of the sequence of random numbers. Compare the resulted outputs of the following commands: ```{r} set.seed(100) rnorm(5) rnorm(5) set.seed(100) rnorm(5) ``` --- ### Run the function 10000 times, saving the results ```{r} set.seed(444) # here we create an empty numeric vector of #length 10000 to store our results pdif <- numeric(10000) ## Your turn to write the for-loop ``` --- class: inverse, center, middle # Plotting with ggplot2 ```{r, out.width = "200px", echo=FALSE, fig.align='center'} knitr::include_graphics("https://d21ii91i3y6o6h.cloudfront.net/gallery_images/from_proof/9296/large/1447173871/rstudio-hex-ggplot2-dot-psd.png") ``` --- ### Histogram ```{r fig.width = 15, fig.height = 5 } library(ggplot2) # 'economics' is the name of the data frame and # it has a variable called 'pce'. ggplot(data = economics, aes(x = pce)) + geom_histogram(binwidth = 500, colour = "blue", fill ="lightblue")+ geom_vline(xintercept = 10000 , colour = "red") ``` - *binwidth is the width of the histogram bins* --- ### Your turn 1. Make a histogram of the results. 2. Draw a vertical line on the plot that represent the difference for the actual data. ```{r eval = FALSE} pdif <- data.frame(pdif) # your turn to use ggplot to produce the histogram ``` --- class: inverse, center, middle Most of the material I've used here are based on ['Advanced R' by Hadley Wickham](http://adv-r.had.co.nz/) ['R Programming for Data Science' by Roger D. Peng](https://bookdown.org/rdpeng/rprogdatascience/) ['R for Data Science' by Garrett Grolemund and Hadley Wickham](http://r4ds.had.co.nz/) # Happy learning with R :)