I recently watched the “Tidy eval: Programming with dplyr, tidyr, and ggplot2” video. It’s an excellent introduction to the concept of the tidy evaluation, which is the core concept for programming with dplyr
and friends.
In this video, Hadley showed on the slide the grouped_mean
function (12:48). An attempt to implement this functions might be a good exercise in tidy evaluation, and an excellent opportunity to compare this approach with standard evaluation rules provided by the seplyr
package.
Let’s start with the simple example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(hp))
## # A tibble: 3 x 2
## cyl mean
## <dbl> <dbl>
## 1 4 82.6
## 2 6 122.
## 3 8 209.
The code below shows the first version of this function (based on the knowledge from the video).
grouped_mean <- function(dt, group, value) {
group <- enquo(group)
value <- enquo(value)
dt %>%
group_by(!!group) %>%
summarise(mean = mean(!!value))
}
Let’s try it:
grouped_mean(mtcars, cyl, hp)
## # A tibble: 3 x 2
## cyl mean
## <dbl> <dbl>
## 1 4 82.6
## 2 6 122.
## 3 8 209.
grouped_mean(mtcars, gear, mpg)
## # A tibble: 3 x 2
## gear mean
## <dbl> <dbl>
## 1 3 16.1
## 2 4 24.5
## 3 5 21.4
But maybe we want to use more than one variable for grouping? This use case is described here in the section “Capturing multiple variables”. So the second version might look like this (I had to change the order of variables):
grouped_mean2 <- function(dt, value, ...) {
value <- enquo(value)
groups <- quos(...)
dt %>%
group_by(!!!groups) %>%
summarise(mean = mean(!!value))
}
grouped_mean2(mtcars, mpg) # without grouping
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 20.1
grouped_mean2(mtcars, mpg, cyl) # one variable used for grouping
## # A tibble: 3 x 2
## cyl mean
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
grouped_mean2(mtcars, mpg, cyl, gear) # two variables
## # A tibble: 8 x 3
## # Groups: cyl [3]
## cyl gear mean
## <dbl> <dbl> <dbl>
## 1 4 3 21.5
## 2 4 4 26.9
## 3 4 5 28.2
## 4 6 3 19.8
## 5 6 4 19.8
## 6 6 5 19.7
## 7 8 3 15.0
## 8 8 5 15.4
seplyr
However, we might want to pass the column names as strings, so using the nonstandard evaluation might be a problem here. But there’s a seplyr
package, which provides another interface to dplyr in which you can pass a vector of strings. It works perfectly for grouping, but for other functions like summarise
or mutate
it’s not as elegant as the tidy solution.
library(seplyr)
grouped_mean_se <- function(dt, group, value) {
# I pass the R code to summarise_se as a string
# it's not very elegant:(
dt %>%
group_by_se(group) %>%
summarise_se(setNames(sprintf("mean(`%s`)", value), "mean"))
}
grouped_mean_se(mtcars, "cyl", "hp")
## # A tibble: 3 x 2
## cyl mean
## <dbl> <dbl>
## 1 4 82.6
## 2 6 122.
## 3 8 209.
grouped_mean_se(mtcars, "gear", "mpg")
## # A tibble: 3 x 2
## gear mean
## <dbl> <dbl>
## 1 3 16.1
## 2 4 24.5
## 3 5 21.4
The good thing about this solution is that grouping by multiple columns works without any modifications. See the example below:
grouped_mean_se(mtcars, c("gear", "cyl"), "mpg")
## # A tibble: 8 x 3
## # Groups: gear [3]
## gear cyl mean
## <dbl> <dbl> <dbl>
## 1 3 4 21.5
## 2 3 6 19.8
## 3 3 8 15.0
## 4 4 4 26.9
## 5 4 6 19.8
## 6 5 4 28.2
## 7 5 6 19.7
## 8 5 8 15.4
You can use the seplyr approach with tidyeval to make it nicer. Note that rlang::parse_quosure
works as enquo
, but extracts the value from the variable.
grouped_mean_se2 <- function(dt, group, value) {
value <- rlang::parse_quosure(value)
dt %>%
group_by_se(group) %>%
summarise(mean = mean(!!value))
}
grouped_mean_se2(mtcars, c("gear", "cyl"), "hp")
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.
## # A tibble: 8 x 3
## # Groups: gear [3]
## gear cyl mean
## <dbl> <dbl> <dbl>
## 1 3 4 97
## 2 3 6 108.
## 3 3 8 194.
## 4 4 4 76
## 5 4 6 116.
## 6 5 4 102
## 7 5 6 175
## 8 5 8 300.
There are also other possibilities for using the tidyeval approach with seplyr
. One that seems to be useful is to pass grouping variables as a string vector, but use standard dplyr
’s rules in summarise
.
grouped_summarise <- function(dt, group, ...) {
dt %>%
group_by_se(group) %>%
summarise(...)
}
grouped_summarise(
mtcars, "gear",
mean_hp = mean(hp),
mean_mpg = mean(mpg)
)
## # A tibble: 3 x 3
## gear mean_hp mean_mpg
## <dbl> <dbl> <dbl>
## 1 3 176. 16.1
## 2 4 89.5 24.5
## 3 5 196. 21.4
grouped_summarise(
mtcars, c("gear", "cyl"),
mean_hp = mean(hp),
mean_mpg = mean(mpg),
n = n()
)
## # A tibble: 8 x 5
## # Groups: gear [3]
## gear cyl mean_hp mean_mpg n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 4 97 21.5 1
## 2 3 6 108. 19.8 2
## 3 3 8 194. 15.0 12
## 4 4 4 76 26.9 8
## 5 4 6 116. 19.8 4
## 6 5 4 102 28.2 2
## 7 5 6 175 19.7 1
## 8 5 8 300. 15.4 2
The same function, but using only standard evaluation techniques is a bit less elegant because a user needs to pass summarise expressions in the form of strings. It might be a problem because the syntax highlight and a tool for code analysis do not work inside the string. But this approach might be sometimes useful.
grouped_summarise_se <- function(dt, group, vals) {
dt %>%
group_by_se(group) %>%
summarise_se(summarizeTerms = vals)
}
grouped_summarise_se(
mtcars, "gear",
vals = list(
mean_hp = "mean(hp)",
mean_mpg = "mean(mpg)")
)
## # A tibble: 3 x 3
## gear mean_hp mean_mpg
## <dbl> <dbl> <dbl>
## 1 3 176. 16.1
## 2 4 89.5 24.5
## 3 5 196. 21.4
grouped_summarise_se(
mtcars, c("gear", "cyl"),
vals = list(
mean_hp = "mean(hp)",
mean_mpg = "mean(mpg)",
n = "n()"
)
)
## # A tibble: 8 x 5
## # Groups: gear [3]
## gear cyl mean_hp mean_mpg n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 4 97 21.5 1
## 2 3 6 108. 19.8 2
## 3 3 8 194. 15.0 12
## 4 4 4 76 26.9 8
## 5 4 6 116. 19.8 4
## 6 5 4 102 28.2 2
## 7 5 6 175 19.7 1
## 8 5 8 300. 15.4 2
wrapr
The last topic related to nonstandard evaluation rules is a package wrapr
. It allows substituting the variable name in a code block with something else. Consider this simple example - the variable VALUE
, will be replaced by xxx
. I set the eval
parameter to FALSE
, to capture the expression, without evaluating. For more information please check the articles here or here.
value <- "xxx"
wrapr::let(
c(VALUE = value), eval = FALSE,
dt %>%
group_by_se(group) %>%
summarise(mean = mean(VALUE))
)
## dt %>% group_by_se(group) %>% summarise(mean = mean(xxx))
So the final version of grouped_mean
using wrapr::let
might looks like this (and for me, it’s the most elegant solution if we want to use standard evaluation rules and pass string arguments):
grouped_mean_wrapr <- function(dt, group, value) {
wrapr::let(
c(VALUE = value),
dt %>%
group_by_se(group) %>%
summarise(mean = mean(VALUE))
)
}
grouped_mean_wrapr(mtcars, c("cyl", "gear"), "hp")
## # A tibble: 8 x 3
## # Groups: cyl [3]
## cyl gear mean
## <dbl> <dbl> <dbl>
## 1 4 3 97
## 2 4 4 76
## 3 4 5 102
## 4 6 3 108.
## 5 6 4 116.
## 6 6 5 175
## 7 8 3 194.
## 8 8 5 300.
codetools::checkUsage(grouped_mean_wrapr, all = TRUE)
## <anonymous>: no visible binding for global variable 'VALUE' (<text>:2-7)
But there’s one caveat. The automatic tools for checking the code (like codetools::checkUsage
) might treat VALUE
as an undefined variable. It might cause a warning in R CMD check
, so such code would have a problem with getting into CRAN. The easy fix for this is to use the name value
instead of VALUE
inside let
. However, I think that using uppercase variables names is a better solution because they’re more visible, and it’s easier to know which variables are going to be substituted inside the code block. So the other solution is to create an empty variable VALUE
, to turn off this warning.
grouped_mean_wrapr_clean <- function(dt, group, value) {
VALUE <- NULL
wrapr::let(
c(VALUE = value),
dt %>%
group_by_se(group) %>%
summarise(mean = mean(VALUE))
)
}
codetools::checkUsage(grouped_mean_wrapr_clean, all = TRUE)
Summary
In this post I tried to show how you can program with dplyr
’s which is based on tidyeval principle, and some other approaches, the seplyr
which is mostly a dplyr
with standard evaluation rules, and wrapr::let
which uses substitution to get the expected code. From all those three approaches my gut feeling tells me that the wrapr::let
is the most elegant, and precise, but I can’t tell if it is sufficient. Probably all of those three approaches have their use cases.
More resources
Session info
R.version
## _
## platform x86_64-pc-linux-gnu
## arch x86_64
## os linux-gnu
## system x86_64, linux-gnu
## status
## major 3
## minor 6.1
## year 2019
## month 07
## day 05
## svn rev 76782
## language R
## version.string R version 3.6.1 (2019-07-05)
## nickname Action of the Toes
packageVersion("dplyr")
## [1] '0.8.3'
packageVersion("seplyr")
## [1] '0.8.4'
packageVersion("wrapr")
## [1] '1.9.3'