Rowwise Operation
Mar 24, 2019 #plyr #data.table
Compare multiple methods to get row-wise operation done on a data frame.
In this case, we wish to extract the corresponding highest value column.
library(tidyverse)
set.seed(1212)
# a dummy data frame
dummy_df <- matrix(runif(15, min = 1, max = 9), ncol = 3) %>% as_data_frame()
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
colnames(dummy_df) <- c("x", "y", "z")
Method 1: dplyr::rowwise
use_rowwise <- function(df) {
df %>%
rowwise() %>%
mutate(max = names(df)[which.max(c(x, y, z))])
}
use_rowwise(dummy_df)
## Source: local data frame [5 x 4]
## Groups: <by row>
##
## # A tibble: 5 x 4
## x y z max
## <dbl> <dbl> <dbl> <chr>
## 1 3.12 1.27 5.12 z
## 2 1.87 1.70 7.48 z
## 3 8.74 1.47 1.22 x
## 4 3.81 6.63 5.96 y
## 5 6.08 2.94 6.28 z
Method 2: plyr::apply
use_apply <- function(df) {
df %>%
mutate(max = names(df)[apply(df, 1, which.max)])
}
use_apply(dummy_df)
## # A tibble: 5 x 4
## x y z max
## <dbl> <dbl> <dbl> <chr>
## 1 3.12 1.27 5.12 z
## 2 1.87 1.70 7.48 z
## 3 8.74 1.47 1.22 x
## 4 3.81 6.63 5.96 y
## 5 6.08 2.94 6.28 z
Method 3: max.col
from data.table
library(data.table)
use_datatable <- function(df) {
dt <- as.data.table(df)
dt[, max := names(.SD)[max.col(.SD)], .SDcols = 1:3]
}
use_datatable(dummy_df) %>% print()
## x y z max
## 1: 3.117172 1.265315 5.118694 z
## 2: 1.868388 1.695199 7.484665 z
## 3: 8.735410 1.474048 1.217860 x
## 4: 3.808189 6.631653 5.958628 y
## 5: 6.075657 2.938117 6.275357 z
Efficiency
Lets do benchmarking on a larger data frame.
library(microbenchmark)
# for benchmarking
large_df <- matrix(runif(30e5), ncol = 3) %>% as_data_frame()
colnames(large_df) <- names(dummy_df)
dim(large_df)
## [1] 1000000 3
microbenchmark(
use_rowwise(large_df),
use_apply(large_df),
use_datatable(large_df),
times = 30
)
## Unit: milliseconds
## expr min lq mean median
## use_rowwise(large_df) 10642.03850 10828.97933 12185.4581 11591.86977
## use_apply(large_df) 4482.81001 4802.02439 5825.6439 5412.52181
## use_datatable(large_df) 43.81339 45.97957 82.9575 56.46214
## uq max neval
## 13270.8466 14965.8307 30
## 6582.7706 9428.1450 30
## 63.9674 332.6165 30
Oh yea, data.table
is blazingly fast.