Personal code snippets of @tmasjc

Site powered by Hugo + Blogdown

Image by Mads Schmidt Rasmussen from unsplash.com

Minimal Bootstrap Theme by Zachary Betz

Programming in dplyr - Column Manipulation

Sep 7, 2018 #dplyr #tidyeval

How do you program the column manipulation dynamically when working with dplyr verb?

Say, we have a dataframe with multiple columns in pair,

library(tidyverse)

# simulate some data
df <- rerun(3 * 2, rnorm(10, 5, 2)) %>% bind_cols()
# imagine some column names
colnames(df) <- map(c("weight", "height", "length"), paste0, c(".x", ".y")) %>% unlist()
df
## # A tibble: 10 x 6
##    weight.x weight.y height.x height.y length.x length.y
##       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1  6.79        3.00     2.97    2.80      3.37     3.51
##  2  8.01        6.61     7.04    5.90      3.00     4.08
##  3  0.00663     3.87     2.36    4.73      2.32     5.13
##  4  5.97        4.94     3.21    4.04      2.46     5.54
##  5  0.731       5.32     5.36    0.582     3.08     3.55
##  6  5.37        8.28     7.23    4.00      6.13     3.93
##  7  5.02        2.56     3.25    3.76      6.14     6.46
##  8  5.70        1.22     3.59    3.78      5.13     4.43
##  9  5.58        2.95     8.78    5.91      4.22     3.37
## 10  3.43        7.28     3.36    4.78      4.16     4.90

and we try to calculate the difference between metrices .x and .y

Expression

Let formulate a form of “expression”,

calc_diff <- function(var) {
    # the desire expression is (var.x - var.y)
    sprintf("%s.x - %s.y", var, var)
}
# example
calc_diff("weight")
## [1] "weight.x - weight.y"

Rlang

So we get our desire expression, but how do we pass them to dplyr to understand? We can use rlang to transform text into R expressions.

calc_diff <- function(df, var) {
    # parse as R code
    expr <- sprintf("%s.x - %s.y", var, var) %>% 
        rlang::parse_expr()
    df %>% mutate(!!expr)
}

df %>% 
    # add new column
    calc_diff("weight") %>% 
    # focus on result 
    select(contains("weight"))
## # A tibble: 10 x 3
##    weight.x weight.y `weight.x - weight.y`
##       <dbl>    <dbl>                 <dbl>
##  1  6.79        3.00                  3.79
##  2  8.01        6.61                  1.40
##  3  0.00663     3.87                 -3.86
##  4  5.97        4.94                  1.03
##  5  0.731       5.32                 -4.59
##  6  5.37        8.28                 -2.91
##  7  5.02        2.56                  2.46
##  8  5.70        1.22                  4.47
##  9  5.58        2.95                  2.63
## 10  3.43        7.28                 -3.85

Column Name

Well, we need a way to add name to the new column.

calc_diff <- function(df, var) {
    # to transform a quosure to a string 
    new_col_name <- quo_name(paste0(var, ".diff"))
    expr <- sprintf("%s.x - %s.y", var, var) %>% 
        rlang::parse_expr()
    # so that we can use it in dplyr context
    df %>% mutate(!!new_col_name := !!expr)
}

df %>% 
    # it works!
    calc_diff("weight") %>% 
    select(contains("weight"))
## # A tibble: 10 x 3
##    weight.x weight.y weight.diff
##       <dbl>    <dbl>       <dbl>
##  1  6.79        3.00        3.79
##  2  8.01        6.61        1.40
##  3  0.00663     3.87       -3.86
##  4  5.97        4.94        1.03
##  5  0.731       5.32       -4.59
##  6  5.37        8.28       -2.91
##  7  5.02        2.56        2.46
##  8  5.70        1.22        4.47
##  9  5.58        2.95        2.63
## 10  3.43        7.28       -3.85

Many Columns

So now we know how to do with 1 variable, how do we apply it to many?

  1. Use map as per normal to loop and generate multiple expressions
  2. Use set_names to give expressions list a name
  3. Use !!! (3x ! not 2) to splice multiple arguments in mutate. It will assign new variable names using names given by list (in step 2).
calc_diff <- function(df, ...) {
    # first, an expressions list
    n_exprs <- 
        # loop over multiple inputs
        map(list(...), ~ sprintf("%s.x - %s.y", .x, .x) %>% 
                       rlang::parse_expr()) %>% 
        set_names(map(list(...), ~ paste0(.x, ".diff")))
    
    # splices above expressions
    df %>% mutate(!!!n_exprs)
}

# tada~~
df %>% 
    calc_diff("weight", "height", "length")
## # A tibble: 10 x 9
##    weight.x weight.y height.x height.y length.x length.y weight.diff height.diff
##       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>       <dbl>       <dbl>
##  1  6.79        3.00     2.97    2.80      3.37     3.51        3.79       0.172
##  2  8.01        6.61     7.04    5.90      3.00     4.08        1.40       1.15 
##  3  0.00663     3.87     2.36    4.73      2.32     5.13       -3.86      -2.37 
##  4  5.97        4.94     3.21    4.04      2.46     5.54        1.03      -0.830
##  5  0.731       5.32     5.36    0.582     3.08     3.55       -4.59       4.77 
##  6  5.37        8.28     7.23    4.00      6.13     3.93       -2.91       3.24 
##  7  5.02        2.56     3.25    3.76      6.14     6.46        2.46      -0.505
##  8  5.70        1.22     3.59    3.78      5.13     4.43        4.47      -0.183
##  9  5.58        2.95     8.78    5.91      4.22     3.37        2.63       2.87 
## 10  3.43        7.28     3.36    4.78      4.16     4.90       -3.85      -1.42 
## # … with 1 more variable: length.diff <dbl>