Personal code snippets of @tmasjc

Site powered by Hugo + Blogdown

Image by Mads Schmidt Rasmussen from unsplash.com

Minimal Bootstrap Theme by Zachary Betz

Rowwise Operation

Mar 24, 2019 #plyr #data.table

Compare multiple methods to get row-wise operation done on a data frame.

In this case, we wish to extract the corresponding highest value column.

library(tidyverse)
set.seed(1212)

# a dummy data frame 
dummy_df <- matrix(runif(15, min = 1, max = 9), ncol = 3) %>% as_data_frame()
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
colnames(dummy_df) <- c("x", "y", "z")

Method 1: dplyr::rowwise

use_rowwise <- function(df) {
    df %>% 
        rowwise() %>% 
        mutate(max = names(df)[which.max(c(x, y, z))])
}
use_rowwise(dummy_df)
## Source: local data frame [5 x 4]
## Groups: <by row>
## 
## # A tibble: 5 x 4
##       x     y     z max  
##   <dbl> <dbl> <dbl> <chr>
## 1  3.12  1.27  5.12 z    
## 2  1.87  1.70  7.48 z    
## 3  8.74  1.47  1.22 x    
## 4  3.81  6.63  5.96 y    
## 5  6.08  2.94  6.28 z

Method 2: plyr::apply

use_apply <- function(df) {
    df %>% 
        mutate(max = names(df)[apply(df, 1, which.max)])
}
use_apply(dummy_df)
## # A tibble: 5 x 4
##       x     y     z max  
##   <dbl> <dbl> <dbl> <chr>
## 1  3.12  1.27  5.12 z    
## 2  1.87  1.70  7.48 z    
## 3  8.74  1.47  1.22 x    
## 4  3.81  6.63  5.96 y    
## 5  6.08  2.94  6.28 z

Method 3: max.col from data.table

library(data.table)
use_datatable <- function(df) {
    dt <- as.data.table(df)
    dt[, max :=  names(.SD)[max.col(.SD)], .SDcols = 1:3]
}
use_datatable(dummy_df) %>% print()
##           x        y        z max
## 1: 3.117172 1.265315 5.118694   z
## 2: 1.868388 1.695199 7.484665   z
## 3: 8.735410 1.474048 1.217860   x
## 4: 3.808189 6.631653 5.958628   y
## 5: 6.075657 2.938117 6.275357   z

Efficiency

Lets do benchmarking on a larger data frame.

library(microbenchmark)

# for benchmarking
large_df <- matrix(runif(30e5), ncol = 3) %>% as_data_frame()
colnames(large_df) <- names(dummy_df)
dim(large_df)
## [1] 1000000       3
microbenchmark(
    use_rowwise(large_df),
    use_apply(large_df),
    use_datatable(large_df),
    times = 30
)
## Unit: milliseconds
##                     expr         min          lq       mean      median
##    use_rowwise(large_df) 10642.03850 10828.97933 12185.4581 11591.86977
##      use_apply(large_df)  4482.81001  4802.02439  5825.6439  5412.52181
##  use_datatable(large_df)    43.81339    45.97957    82.9575    56.46214
##          uq        max neval
##  13270.8466 14965.8307    30
##   6582.7706  9428.1450    30
##     63.9674   332.6165    30

Oh yea, data.table is blazingly fast.