Programming in dplyr - Column Manipulation
How do you program the column manipulation dynamically when working with dplyr
verb?
Say, we have a dataframe with multiple columns in pair,
library(tidyverse)
# simulate some data
df <- rerun(3 * 2, rnorm(10, 5, 2)) %>% bind_cols()
# imagine some column names
colnames(df) <- map(c("weight", "height", "length"), paste0, c(".x", ".y")) %>% unlist()
df
## # A tibble: 10 x 6
## weight.x weight.y height.x height.y length.x length.y
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.79 3.00 2.97 2.80 3.37 3.51
## 2 8.01 6.61 7.04 5.90 3.00 4.08
## 3 0.00663 3.87 2.36 4.73 2.32 5.13
## 4 5.97 4.94 3.21 4.04 2.46 5.54
## 5 0.731 5.32 5.36 0.582 3.08 3.55
## 6 5.37 8.28 7.23 4.00 6.13 3.93
## 7 5.02 2.56 3.25 3.76 6.14 6.46
## 8 5.70 1.22 3.59 3.78 5.13 4.43
## 9 5.58 2.95 8.78 5.91 4.22 3.37
## 10 3.43 7.28 3.36 4.78 4.16 4.90
and we try to calculate the difference between metrices .x and .y
Expression
Let formulate a form of “expression”,
calc_diff <- function(var) {
# the desire expression is (var.x - var.y)
sprintf("%s.x - %s.y", var, var)
}
# example
calc_diff("weight")
## [1] "weight.x - weight.y"
Rlang
So we get our desire expression, but how do we pass them to dplyr
to understand? We can use rlang
to transform text into R expressions.
calc_diff <- function(df, var) {
# parse as R code
expr <- sprintf("%s.x - %s.y", var, var) %>%
rlang::parse_expr()
df %>% mutate(!!expr)
}
df %>%
# add new column
calc_diff("weight") %>%
# focus on result
select(contains("weight"))
## # A tibble: 10 x 3
## weight.x weight.y `weight.x - weight.y`
## <dbl> <dbl> <dbl>
## 1 6.79 3.00 3.79
## 2 8.01 6.61 1.40
## 3 0.00663 3.87 -3.86
## 4 5.97 4.94 1.03
## 5 0.731 5.32 -4.59
## 6 5.37 8.28 -2.91
## 7 5.02 2.56 2.46
## 8 5.70 1.22 4.47
## 9 5.58 2.95 2.63
## 10 3.43 7.28 -3.85
Column Name
Well, we need a way to add name to the new column.
calc_diff <- function(df, var) {
# to transform a quosure to a string
new_col_name <- quo_name(paste0(var, ".diff"))
expr <- sprintf("%s.x - %s.y", var, var) %>%
rlang::parse_expr()
# so that we can use it in dplyr context
df %>% mutate(!!new_col_name := !!expr)
}
df %>%
# it works!
calc_diff("weight") %>%
select(contains("weight"))
## # A tibble: 10 x 3
## weight.x weight.y weight.diff
## <dbl> <dbl> <dbl>
## 1 6.79 3.00 3.79
## 2 8.01 6.61 1.40
## 3 0.00663 3.87 -3.86
## 4 5.97 4.94 1.03
## 5 0.731 5.32 -4.59
## 6 5.37 8.28 -2.91
## 7 5.02 2.56 2.46
## 8 5.70 1.22 4.47
## 9 5.58 2.95 2.63
## 10 3.43 7.28 -3.85
Many Columns
So now we know how to do with 1 variable, how do we apply it to many?
- Use
map
as per normal to loop and generate multiple expressions - Use
set_names
to give expressions list a name - Use
!!!
(3x ! not 2) to splice multiple arguments inmutate
. It will assign new variable names using names given by list (in step 2).
calc_diff <- function(df, ...) {
# first, an expressions list
n_exprs <-
# loop over multiple inputs
map(list(...), ~ sprintf("%s.x - %s.y", .x, .x) %>%
rlang::parse_expr()) %>%
set_names(map(list(...), ~ paste0(.x, ".diff")))
# splices above expressions
df %>% mutate(!!!n_exprs)
}
# tada~~
df %>%
calc_diff("weight", "height", "length")
## # A tibble: 10 x 9
## weight.x weight.y height.x height.y length.x length.y weight.diff height.diff
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.79 3.00 2.97 2.80 3.37 3.51 3.79 0.172
## 2 8.01 6.61 7.04 5.90 3.00 4.08 1.40 1.15
## 3 0.00663 3.87 2.36 4.73 2.32 5.13 -3.86 -2.37
## 4 5.97 4.94 3.21 4.04 2.46 5.54 1.03 -0.830
## 5 0.731 5.32 5.36 0.582 3.08 3.55 -4.59 4.77
## 6 5.37 8.28 7.23 4.00 6.13 3.93 -2.91 3.24
## 7 5.02 2.56 3.25 3.76 6.14 6.46 2.46 -0.505
## 8 5.70 1.22 3.59 3.78 5.13 4.43 4.47 -0.183
## 9 5.58 2.95 8.78 5.91 4.22 3.37 2.63 2.87
## 10 3.43 7.28 3.36 4.78 4.16 4.90 -3.85 -1.42
## # … with 1 more variable: length.diff <dbl>