class: center, middle, inverse, title-slide # Statistical Thinking using Randomisation and Simulation ## Compiling data for problem solving ### Di Cook (
dicook@monash.edu
,
@visnut
) ### W10.C2 --- # Overview of this class - Reading different `data formats` - Handling missing data - String operations, working with `text` - Grammar of graphics - Randomisation to assess structure perceived in plots --- # Reading different file formats: shapefiles The Australian Electorate Commission publishes the boundaries of the electorates on their website at [http://www.aec.gov.au/Electorates/gis/gis_datadownload.htm](http://www.aec.gov.au/Electorates/gis/gis_datadownload.htm). Once the files (preferably the national files) are downloaded, unzip the file (it will build a folder with a set of files). We want to read the shapes contained in the `shp` file into R. --- ```r library(maptools) # shapeFile contains the path to the shp file: shapeFile <- "../data/vic-esri-24122010/vic 24122010.shp" sF <- readShapeSpatial(shapeFile) class(sF) #> [1] "SpatialPolygonsDataFrame" #> attr(,"package") #> [1] "sp" ``` --- `sF` is a spatial data frame containing all of the polygons. We use the `rmapshaper` package available from ateucher's github page to thin the polygons while preserving the geography: ```r library(rmapshaper) ``` ```r sFsmall <- ms_simplify(sF, keep=0.05) # use instead of thinnedSpatialPoly ``` --- `keep` indicates the percentage of points we want to keep in the polygons. 5% makes the electorate boundary still quite recognizable, but reduce the overall size of the map considerably, making it faster to plot. --- We can use base graphics to plot this map: ```r plot(sFsmall) ``` <img src="week10.class2_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Extracting the electorate information A spatial polygons data frame consists of both a data set with information on each of the entities (in this case, electorates), and a set of polygons for each electorate (sometimes multiple polygons are needed, e.g. if the electorate has islands). We want to extract both of these parts. --- ```r nat_data <- sF@data head(nat_data) #> GEODB_OID OBJECTID DIV_NUMBER ELECT_DIV NUMCCDS ACTUAL PROJECTED #> 0 1 1 1 Aston 190 92370 98469 #> 1 2 2 2 Ballarat 274 95003 100786 #> 2 3 3 3 Batman 265 96909 104258 #> 3 4 4 4 Bendigo 284 95729 102582 #> 4 5 5 5 Bruce 226 95472 99904 #> 5 6 6 6 Calwell 214 99031 104734 #> POPULATION OVER_18 AREA_SQKM SORTNAME MAP_SYMBOL MAP_TYPE #> 0 0 0 98.9337 Aston Final Divisional Boundary Metro #> 1 0 0 4651.6400 Ballarat Final Divisional Boundary Rural #> 2 0 0 65.6887 Batman Final Divisional Boundary Metro #> 3 0 0 6255.0000 Bendigo Final Divisional Boundary Rural #> 4 0 0 72.6900 Bruce Final Divisional Boundary Metro #> 5 0 0 174.7130 Calwell Final Divisional Boundary Metro #> LENGTH SHAPE_AREA #> 0 58422.89 99056594 #> 1 417555.12 4652896627 #> 2 58226.02 65717035 #> 3 622901.96 6255411672 #> 4 48696.38 72629548 #> 5 81925.61 174767685 ``` --- The row names of the data file are identifiers corresponding to the polygons - we want to make them a separate variable: ```r nat_data$id <- row.names(nat_data) ``` --- # Extracting the polygon information The `fortify` function in the `ggplot2` package extracts the polygons into a data frame. ```r nat_map <- ggplot2::fortify(sFsmall) head(nat_map) #> long lat order hole piece id group #> 1 2525287 2407463 1 FALSE 1 0 0.1 #> 2 2525840 2407413 2 FALSE 1 0 0.1 #> 3 2527187 2406561 3 FALSE 1 0 0.1 #> 4 2527167 2406442 4 FALSE 1 0 0.1 #> 5 2527987 2406306 5 FALSE 1 0 0.1 #> 6 2527967 2406186 6 FALSE 1 0 0.1 ``` --- We need to make sure that `group` and `piece` are kept as factor variables - if they are allowed to be converted to numeric values, it messes things up, because as factor levels `9` and `9.0` are distinct, whereas they are not when interpreted as numbers ... ```r nat_map$group <- paste("g",nat_map$group,sep=".") nat_map$piece <- paste("p",nat_map$piece,sep=".") head(nat_map) #> long lat order hole piece id group #> 1 2525287 2407463 1 FALSE p.1 0 g.0.1 #> 2 2525840 2407413 2 FALSE p.1 0 g.0.1 #> 3 2527187 2406561 3 FALSE p.1 0 g.0.1 #> 4 2527167 2406442 4 FALSE p.1 0 g.0.1 #> 5 2527987 2406306 5 FALSE p.1 0 g.0.1 #> 6 2527967 2406186 6 FALSE p.1 0 g.0.1 ``` --- # Plot it ```r ggplot(nat_map, aes(x=long, y=lat, group=group)) + geom_polygon(fill="white", colour="black") ``` <img src="week10.class2_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Add measured variables ```r ggplot(nat_data, aes(fill=AREA_SQKM)) + geom_map(aes(map_id=id), map=nat_map) + expand_limits(x=nat_map$long, y=nat_map$lat) + theme_map() ``` <img src="week10.class2_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Handling missing values - Need to know how the missings are coded, hopefully clearly missing, treated as `\(NA\)` in R, not `\(0\)`, or `\(-9\)`, or `\(-9999\)`, or `\(.\)` Recode as need be. - Study the distribution of missing vs not missing, which will help determine how to handle them. --- # What ways can these affect analysis? - If missings happen when conditions are special, eg sensor tends to stop when temperature drops below 3 degrees Celsius, estimation of model parameters may not reflect the population parameters - Some techniques, particularly multivariate methods like many used in data mining require complete records over many variables. Just a few missing numbers can mean a lot of cases that cannot be used. --- # Making it Easy - MissingDataGUI - Methods for summarising missings in a data set - Ways to plot to examine dependence between missing vs not missing - Imputation methods to substitute missings ``` library(naniar) vis_miss(tao, sort_miss=TRUE) + theme(aspect.ratio=1) ``` <img src="week10.class2_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Numerical summaries Proportion of observations missing: ``` #> [1] 0.02672101 ``` Proportion of variables missing: ``` #> [1] 0.3333333 ``` How many observations have `\(k\)` missings? ``` #> [[1]] #> # A tibble: 4 x 3 #> n_missing_in_case n_cases percent #> <int> <int> <dbl> #> 1 0 565 76.7663043 #> 2 1 167 22.6902174 #> 3 2 2 0.2717391 #> 4 3 2 0.2717391 ``` --- # By group ``` #> [[1]] #> # A tibble: 6 x 4 #> year n_missing_in_case n_cases percent #> <int> <int> <int> <dbl> #> 1 1997 0 291 79.0760870 #> 2 1997 1 77 20.9239130 #> 3 1993 0 274 74.4565217 #> 4 1993 1 90 24.4565217 #> 5 1993 2 2 0.5434783 #> 6 1993 3 2 0.5434783 ``` --- # Missings shouldn't be ignored but most software will simply drop them! ``` #> Warning: Removed 94 rows containing missing values (geom_point). ``` <img src="week10.class2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Keep them in the plot <img src="week10.class2_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- # by year <img src="week10.class2_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # Understanding missing dependencies <img src="week10.class2_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> Year needs to be accounted for in finding good substu=itute values. --- # Relationship with other variables <img src="week10.class2_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # Handling missings - An small fraction of cases have several missings, drop the cases - A variable or two, out of many, have a lot of missings, drop the variables - If missings are small in number, but located in many cases and variables, you need to impute these values, to do most analyses - Designing the imputation should take into account dependencies that you have seen between missingness and existing variables. - For the ocean buoys data this means imputation needs to be done separately by year --- # Common ways to impute values - Simple parametric: use the mean or median of the complete cases for each variable - Simple non-parametric: find the `\(k\)` nearest neighbours with a complete value and average these - Multiple imputation: Use a statistical distribution, e.g. normal model and simulate a value (or set of values, hot deck imputation) for the missings --- # Examples - using the mean and ignoring year. <img src="week10.class2_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> POOR MATCH! --- # by year <img src="week10.class2_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> Better, but still a bit weird! --- # Nearest neighbors imputation <img src="week10.class2_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> A LITTLE BETTER! --- # by year <img src="week10.class2_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> MUCH BETTER! --- # Famous example of ignoring missings - Subsequent investigation determined that the cause was failure of the O-ring seals used to isolate the fuel supply from burning gases. - NASA staff ignored observations where no O-rings failed. ![](challenger.png) ![](challenger_plot.gif) [http://www.asktog.com/books/challengerExerpt.html](http://www.asktog.com/books/challengerExerpt.html) --- # Working with text ```r tb <- read_csv("../data/tb.csv") tb[7:10,1:10] #> # A tibble: 4 x 10 #> iso2 year m_04 m_514 m_014 m_1524 m_2534 m_3544 m_4554 m_5564 #> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> #> 1 AD 1996 NA NA 0 0 0 4 1 0 #> 2 AD 1997 NA NA 0 0 1 2 2 1 #> 3 AD 1998 NA NA 0 0 0 1 0 0 #> 4 AD 1999 NA NA 0 0 0 1 1 0 ``` --- # Convert to long form ```r tb_long <- tb %>% gather(variable, count, m_04:f_u) head(tb_long) #> # A tibble: 6 x 4 #> iso2 year variable count #> <chr> <int> <chr> <int> #> 1 AD 1989 m_04 NA #> 2 AD 1990 m_04 NA #> 3 AD 1991 m_04 NA #> 4 AD 1992 m_04 NA #> 5 AD 1993 m_04 NA #> 6 AD 1994 m_04 NA ``` --- # String split ```r tb_long <- tb_long %>% separate(variable, c("gender", "age")) head(tb_long) #> # A tibble: 6 x 5 #> iso2 year gender age count #> <chr> <int> <chr> <chr> <int> #> 1 AD 1989 m 04 NA #> 2 AD 1990 m 04 NA #> 3 AD 1991 m 04 NA #> 4 AD 1992 m 04 NA #> 5 AD 1993 m 04 NA #> 6 AD 1994 m 04 NA ``` --- # Take a look ```r tb_long %>% filter(iso2 == "CO", age!="u", year>1998) %>% ggplot(aes(x=year, y=count, colour=gender)) + geom_point() + facet_wrap(~age, ncol=3) + scale_color_brewer(palette="Dark2") ``` <img src="week10.class2_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- # More complex text Working with text more generally involves learning about *regular expressions*, which are a very terse, succint language for searching and replacing in strings. Hard to get up and running but enables efficient handling of text. --- # Resources - [eechidna](https://cran.r-project.org/web/packages/eechidna/index.html): Exploring Election and Census Highly Informative Data Nationally for Australia - [AEC electorate polygons](http://www.aec.gov.au/Electorates/gis/gis_datadownload.htm) - [Paper on exploring missingness](https://www.jstatsoft.org/article/view/v068i06/v68i06.pdf) - [Slides](https://stat585-at-isu.github.io/materials/07_regular-expressions/01_regular_expressions.html) on regular expressions from stat545 at ISU --- class: inverse middle # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.