magrittr + dplyr + tidyr: Data manipulation tools from the Hadleyverse

Andrea Rau
April 1, 2016

alt text alt text alt text

The Hadleyverse

Programmers + reproducibility:

  • testthat, devtools, magrittr (with Stefan Bache)
  • rmarkdown, knitr

Data manipulation

  • dplyr, tidyr, broom, …

Graphics and data transformation:

  • ggplot2, ggmap, classifly, clusterfly, …

alt text

Pipe basics: magrittr

Three ways to execute several function on an object: 1) intermediate steps, 2) nested functions, 3) pipes

Use intermediate steps

mat <- matrix(rnorm(100), nrow=25, ncol=4)
tmp2 <- rowMeans(mat)
tmp3 <- max(tmp2)
round(tmp3, 2)
[1] 1.6

Pipe basics: magrittr

Three ways to execute several function on an object: 1) intermediate steps, 2) nested functions, 3) pipes

Nested functions

round(max(rowMeans(mat)), 2)
[1] 1.6

Pipes

library(magrittr)
mat %>% rowMeans %>% max %>% round(.,2)
[1] 1.6

Pipes allow us to reduce the number of nested (), write code that reads like pseudocode, increase organization!

dplyr basics

dplyr is a relatively new (2014) package using R/C++ that provides easy and fast tools for the most common data.frame manipulation tasks

Core principle is the “split-apply-combine” approach to data analysis:

  • Split the data into subsets designated by group membership
  • Apply a function to each data split
  • Combine the results into a new data object

Accepts a limited number of input objects: data.frame, data.table (from the data.table package), SQL tables, data cubes (experimental)

dplyr basics: one-table manipulation

dplyr is built around 5 main verbs:

  1. select() certain columns of data using additions like contains(), starts_with() and, ends_with()
  2. filter() your data to select specific rows (and slice())
  3. arrange() the rows of your data into an order
  4. mutate() your data frame to contain new columns
  5. summarise() chunks of your data in some way
  • group_by(), sample_n(), sample_frac() …

Example: Fisher's famous iris data

data(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

dplyr example on iris data

library(dplyr)
## select(iris, starts_with("P")) 
## select(iris, -Species)
## head(iris[,grep("Petal", colnames(iris))])
iris %>% select(contains("Petal")) %>% head
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
4          1.5         0.2
5          1.4         0.2
6          1.7         0.4

dplyr example on iris data

## head(iris[which(iris$Species == "setosa" & iris$Petal.Length < 5),])
iris %>% filter(Species=="setosa", Petal.Length < 5) %>% head
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

dplyr example on iris data

## iris[1:3, ]
## head(iris[order(iris$Petal.Length),])
iris %>% slice(1:3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
iris %>% arrange(Petal.Length) %>% head
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.6         3.6          1.0         0.2  setosa
2          4.3         3.0          1.1         0.1  setosa
3          5.8         4.0          1.2         0.2  setosa
4          5.0         3.2          1.2         0.2  setosa
5          4.7         3.2          1.3         0.2  setosa
6          5.4         3.9          1.3         0.4  setosa

dplyr example on iris data

## head(data.frame(iris, Sepal.Length.mm = iris$Sepal.Length*10))
iris %>% mutate(Sepal.Length.mm = Sepal.Length*10) %>% head
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
  Sepal.Length.mm
1              51
2              49
3              47
4              46
5              50
6              54

dplyr: one-table manipulation summary

alt text

## mean(iris$Sepal.Length, na.rm=TRUE)
iris %>% summarise(sepalMean=mean(Sepal.Length, na.rm=TRUE))
  sepalMean
1  5.843333

dplyr example on iris data

alt text

## data.frame(aggregate(.~Species, data=iris, mean, simplify=TRUE), 
##       total=aggregate(Sepal.Length~Species, data=iris, length, simplify=TRUE)$Sepal.Length)
gb <- group_by(iris,Species) %>%  
  summarise(SL=mean(Sepal.Length), SW=mean(Sepal.Width), 
            PL=mean(Petal.Length), PW=mean(Petal.Width), total=n())
gb
Source: local data frame [3 x 6]

     Species    SL    SW    PL    PW total
      (fctr) (dbl) (dbl) (dbl) (dbl) (int)
1     setosa 5.006 3.428 1.462 0.246    50
2 versicolor 5.936 2.770 4.260 1.326    50
3  virginica 6.588 2.974 5.552 2.026    50

dplyr example on iris data

## dim(iris[sample(1:nrow(iris), 50)])
iris %>% sample_n(50) %>% dim
[1] 50  5
## dim(iris[sample(1:nrow(iris), 50, replace=TRUE)])
iris %>% sample_n(50, replace=TRUE) %>% dim
[1] 50  5
## Sample 10% of observations
iris %>% sample_frac(0.10) %>% dim
[1] 15  5

dplyr basics: two-table manipulation

Two-table verbs refer to a merge/join operation:

  1. inner_join(A,B,…): return rows common to A and B
  2. left_join(A,B,…): all of A and matching rows of B
  3. semi_join(A,B,…): include all rows of A that match B
  4. anti_join(A,B,…)

alt text

dplyr join example: superhero data

(Courtesy of Jenny Bryan)

superheroes
Source: local data frame [7 x 4]

      name alignment gender         publisher
     (chr)     (chr)  (chr)             (chr)
1  Magneto       bad   male            Marvel
2    Storm      good female            Marvel
3 Mystique       bad female            Marvel
4   Batman      good   male                DC
5    Joker       bad   male                DC
6 Catwoman       bad female                DC
7  Hellboy      good   male Dark Horse Comics
publishers
Source: local data frame [3 x 2]

  publisher yr_founded
      (chr)      (int)
1        DC       1934
2    Marvel       1939
3     Image       1992

dplyr join example: inner join

Note: we drop Hellboy because his publisher Dark Horse Comics is not in publishers.

inner_join(superheroes, publishers)
Source: local data frame [6 x 5]

      name alignment gender publisher yr_founded
     (chr)     (chr)  (chr)     (chr)      (int)
1  Magneto       bad   male    Marvel       1939
2    Storm      good female    Marvel       1939
3 Mystique       bad female    Marvel       1939
4   Batman      good   male        DC       1934
5    Joker       bad   male        DC       1934
6 Catwoman       bad female        DC       1934

dplyr join example: left join

Now we have all superheroes, but yr_founded for Hellboy is NA since he is not in publishers.

left_join(superheroes, publishers)
Source: local data frame [7 x 5]

      name alignment gender         publisher yr_founded
     (chr)     (chr)  (chr)             (chr)      (int)
1  Magneto       bad   male            Marvel       1939
2    Storm      good female            Marvel       1939
3 Mystique       bad female            Marvel       1939
4   Batman      good   male                DC       1934
5    Joker       bad   male                DC       1934
6 Catwoman       bad female                DC       1934
7  Hellboy      good   male Dark Horse Comics         NA

dplyr join example: left join

After swapping the order of publishers and superheroes, now we have all publishers but name/alignment/gender for Image is NA.

left_join(publishers, superheroes)
Source: local data frame [7 x 5]

  publisher yr_founded     name alignment gender
      (chr)      (int)    (chr)     (chr)  (chr)
1        DC       1934   Batman      good   male
2        DC       1934    Joker       bad   male
3        DC       1934 Catwoman       bad female
4    Marvel       1939  Magneto       bad   male
5    Marvel       1939    Storm      good female
6    Marvel       1939 Mystique       bad female
7     Image       1992       NA        NA     NA

dplyr in practice

  • tbl_df is a useful wrapper around large data frames as it avoids printing large amount of data to the screen

  • dplyr works with local in-memory data like data frames and data tables as well as remote on-disk data stored in databases if you have data that do not fit in memory –> supports SQLite, MySQL, PostgreSQL, and Google's bigquery (see https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)

  • It is a good idea to load dplyr last among packages since some function names (like filter) overlap with base R functions… or use dplyr::filter to be safe

tidyr basics

Tidy data = standard method of data storage for improved interpretability + integration

Principles of tidy data:

  1. Rows are observations
  2. Variables are columns
  3. Observational units are tables

tidyr is a package designed for data cleaning and uses two main functions: gather() and separate().

alt text

Messy data example

messy <- data.frame(site=c("a", "b", "c", "d", "e"), trt1=c(56,76,43,25,21), trt2=c(123,234,187,198,23))
messy
  site trt1 trt2
1    a   56  123
2    b   76  234
3    c   43  187
4    d   25  198
5    e   21   23

alt text

tidyr basics: gather

gather() takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed

library(tidyr)
gm <- messy %>% gather(treatment, count, trt1:trt2)
gm
   site treatment count
1     a      trt1    56
2     b      trt1    76
3     c      trt1    43
4     d      trt1    25
5     e      trt1    21
6     a      trt2   123
7     b      trt2   234
8     c      trt2   187
9     d      trt2   198
10    e      trt2    23

tidyr basics: separate

separate() takes a single column and separates it into several columns

df <- data.frame(x = c("a.b", "a.d", "b.c"))
df
    x
1 a.b
2 a.d
3 b.c
df %>% separate(x, c("A", "B")) 
  A B
1 a b
2 a d
3 b c

tidyr basics: spread

spread() unstacks two variables (one a factor and the other a vector of values) into multiple columns whose names are the factor levels (opposite of gather())

gm %>% spread(key=treatment, value=count)
  site trt1 trt2
1    a   56  123
2    b   76  234
3    c   43  187
4    d   25  198
5    e   21   23

Wrapping up with dplyr and tidyr

“Whenever you’re learning a new tool, for a long time you’re going to suck… But the good news is that is typical, that’s something that happens to everyone, and it’s only temporary.” (Hadley Wickham, useR! 2014 tutorial)