Andrea Rau
April 1, 2016
Programmers + reproducibility:
Data manipulation
Graphics and data transformation:
Three ways to execute several function on an object: 1) intermediate steps, 2) nested functions, 3) pipes
Use intermediate steps
mat <- matrix(rnorm(100), nrow=25, ncol=4)
tmp2 <- rowMeans(mat)
tmp3 <- max(tmp2)
round(tmp3, 2)
[1] 1.6
Three ways to execute several function on an object: 1) intermediate steps, 2) nested functions, 3) pipes
Nested functions
round(max(rowMeans(mat)), 2)
[1] 1.6
Pipes
library(magrittr)
mat %>% rowMeans %>% max %>% round(.,2)
[1] 1.6
Pipes allow us to reduce the number of nested (), write code that reads like pseudocode, increase organization!
dplyr is a relatively new (2014) package using R/C++ that provides easy and fast tools for the most common data.frame manipulation tasks
Core principle is the “split-apply-combine” approach to data analysis:
Accepts a limited number of input objects: data.frame, data.table (from the data.table
package), SQL tables, data cubes (experimental)
dplyr is built around 5 main verbs:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
library(dplyr)
## select(iris, starts_with("P"))
## select(iris, -Species)
## head(iris[,grep("Petal", colnames(iris))])
iris %>% select(contains("Petal")) %>% head
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
## head(iris[which(iris$Species == "setosa" & iris$Petal.Length < 5),])
iris %>% filter(Species=="setosa", Petal.Length < 5) %>% head
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
## iris[1:3, ]
## head(iris[order(iris$Petal.Length),])
iris %>% slice(1:3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
iris %>% arrange(Petal.Length) %>% head
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.6 3.6 1.0 0.2 setosa
2 4.3 3.0 1.1 0.1 setosa
3 5.8 4.0 1.2 0.2 setosa
4 5.0 3.2 1.2 0.2 setosa
5 4.7 3.2 1.3 0.2 setosa
6 5.4 3.9 1.3 0.4 setosa
## head(data.frame(iris, Sepal.Length.mm = iris$Sepal.Length*10))
iris %>% mutate(Sepal.Length.mm = Sepal.Length*10) %>% head
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Sepal.Length.mm
1 51
2 49
3 47
4 46
5 50
6 54
## mean(iris$Sepal.Length, na.rm=TRUE)
iris %>% summarise(sepalMean=mean(Sepal.Length, na.rm=TRUE))
sepalMean
1 5.843333
## data.frame(aggregate(.~Species, data=iris, mean, simplify=TRUE),
## total=aggregate(Sepal.Length~Species, data=iris, length, simplify=TRUE)$Sepal.Length)
gb <- group_by(iris,Species) %>%
summarise(SL=mean(Sepal.Length), SW=mean(Sepal.Width),
PL=mean(Petal.Length), PW=mean(Petal.Width), total=n())
gb
Source: local data frame [3 x 6]
Species SL SW PL PW total
(fctr) (dbl) (dbl) (dbl) (dbl) (int)
1 setosa 5.006 3.428 1.462 0.246 50
2 versicolor 5.936 2.770 4.260 1.326 50
3 virginica 6.588 2.974 5.552 2.026 50
## dim(iris[sample(1:nrow(iris), 50)])
iris %>% sample_n(50) %>% dim
[1] 50 5
## dim(iris[sample(1:nrow(iris), 50, replace=TRUE)])
iris %>% sample_n(50, replace=TRUE) %>% dim
[1] 50 5
## Sample 10% of observations
iris %>% sample_frac(0.10) %>% dim
[1] 15 5
Two-table verbs refer to a merge/join operation:
(Courtesy of Jenny Bryan)
superheroes
Source: local data frame [7 x 4]
name alignment gender publisher
(chr) (chr) (chr) (chr)
1 Magneto bad male Marvel
2 Storm good female Marvel
3 Mystique bad female Marvel
4 Batman good male DC
5 Joker bad male DC
6 Catwoman bad female DC
7 Hellboy good male Dark Horse Comics
publishers
Source: local data frame [3 x 2]
publisher yr_founded
(chr) (int)
1 DC 1934
2 Marvel 1939
3 Image 1992
Note: we drop Hellboy because his publisher Dark Horse Comics is not in publishers
.
inner_join(superheroes, publishers)
Source: local data frame [6 x 5]
name alignment gender publisher yr_founded
(chr) (chr) (chr) (chr) (int)
1 Magneto bad male Marvel 1939
2 Storm good female Marvel 1939
3 Mystique bad female Marvel 1939
4 Batman good male DC 1934
5 Joker bad male DC 1934
6 Catwoman bad female DC 1934
Now we have all superheroes, but yr_founded for Hellboy is NA since he is not in publishers
.
left_join(superheroes, publishers)
Source: local data frame [7 x 5]
name alignment gender publisher yr_founded
(chr) (chr) (chr) (chr) (int)
1 Magneto bad male Marvel 1939
2 Storm good female Marvel 1939
3 Mystique bad female Marvel 1939
4 Batman good male DC 1934
5 Joker bad male DC 1934
6 Catwoman bad female DC 1934
7 Hellboy good male Dark Horse Comics NA
After swapping the order of publishers and superheroes, now we have all publishers but name/alignment/gender for Image is NA.
left_join(publishers, superheroes)
Source: local data frame [7 x 5]
publisher yr_founded name alignment gender
(chr) (int) (chr) (chr) (chr)
1 DC 1934 Batman good male
2 DC 1934 Joker bad male
3 DC 1934 Catwoman bad female
4 Marvel 1939 Magneto bad male
5 Marvel 1939 Storm good female
6 Marvel 1939 Mystique bad female
7 Image 1992 NA NA NA
tbl_df is a useful wrapper around large data frames as it avoids printing large amount of data to the screen
dplyr works with local in-memory data like data frames and data tables as well as remote on-disk data stored in databases if you have data that do not fit in memory –> supports SQLite, MySQL, PostgreSQL, and Google's bigquery (see https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)
It is a good idea to load dplyr last among packages since some function names (like filter) overlap with base R functions… or use dplyr::filter
to be safe
Tidy data = standard method of data storage for improved interpretability + integration
Principles of tidy data:
tidyr is a package designed for data cleaning and uses two main functions: gather() and separate().
messy <- data.frame(site=c("a", "b", "c", "d", "e"), trt1=c(56,76,43,25,21), trt2=c(123,234,187,198,23))
messy
site trt1 trt2
1 a 56 123
2 b 76 234
3 c 43 187
4 d 25 198
5 e 21 23
gather() takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed
library(tidyr)
gm <- messy %>% gather(treatment, count, trt1:trt2)
gm
site treatment count
1 a trt1 56
2 b trt1 76
3 c trt1 43
4 d trt1 25
5 e trt1 21
6 a trt2 123
7 b trt2 234
8 c trt2 187
9 d trt2 198
10 e trt2 23
separate() takes a single column and separates it into several columns
df <- data.frame(x = c("a.b", "a.d", "b.c"))
df
x
1 a.b
2 a.d
3 b.c
df %>% separate(x, c("A", "B"))
A B
1 a b
2 a d
3 b c
spread() unstacks two variables (one a factor and the other a vector of values) into multiple columns whose names are the factor levels (opposite of gather())
gm %>% spread(key=treatment, value=count)
site trt1 trt2
1 a 56 123
2 b 76 234
3 c 43 187
4 d 25 198
5 e 21 23
“Whenever you’re learning a new tool, for a long time you’re going to suck… But the good news is that is typical, that’s something that happens to everyone, and it’s only temporary.” (Hadley Wickham, useR! 2014 tutorial)
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Tidy data is easy to inpute into the graphing capabilities of ggplot2 (see next eCafe)…