在R中加速合并多个数据帧_R_Performance_Dataframe_Merge

在R中加速合并多个数据帧

r performance dataframe merge

在R中加速合并多个数据帧,r,performance,dataframe,merge,R,Performance,Dataframe,Merge,我目前正在使用下面的代码来合并>130个数据帧，并且代码需要花费太多的时间来运行（实际上，我从来没有在如此大的数据集上完成过，只是在子集上）。每个表有两列：单位（字符串）和计数（整数）。我按单位合并 tables <- lapply(files, function(x) read.table(x), col.names=c("unit", x)))) MyMerge <- function(x, y){ df <- merge(x, y, by="unit", all.x

我目前正在使用下面的代码来合并>130个数据帧，并且代码需要花费太多的时间来运行（实际上，我从来没有在如此大的数据集上完成过，只是在子集上）。每个表有两列：单位（字符串）和计数（整数）。我按单位合并

tables <- lapply(files, function(x) read.table(x), col.names=c("unit", x))))

MyMerge <- function(x, y){
  df <- merge(x, y, by="unit", all.x= TRUE, all.y= TRUE)
  return(df)
}

data <- Reduce(MyMerge, tables)

表格这里是一个小的比较，首先是一个相当小的数据集，然后是一个较大的数据集：
library(data.table)
library(plyr)
library(dplyr)
library(microbenchmark)

# sample size: 
n = 4e3

# create some data.frames:
df_list <- lapply(1:100, function(x) {
  out <- data.frame(id = c(1:n), 
                    type = sample(c("coffee", "americano", "espresso"),n, replace=T))
  names(out)[2] <- paste0(names(out)[2], x)
  out})

# transform dfs into data.tables:
dt_list <- lapply(df_list, function(x) {
  out <- as.data.table(x)
  setkey(out, "id")
  out
})

# set options to outer join for all methods:    
mymerge <- function(...) base::merge(..., by="id", all=T)
mydplyr <- function(...) dplyr::full_join(..., by="id")
myplyr <- function(...) plyr::join(..., by="id", type="full")
mydt <- function(...) merge(..., by="id", all=T)

# Compare:
microbenchmark(base = Reduce(mymerge, df_list),
               dplyr= Reduce(mydplyr, df_list),
               plyr = Reduce(myplyr, df_list),
               dt = Reduce(mydt, dt_list), times=50)

我们可以看到这两位选手是dplyr
和data.table
。将样本大小更改为5e5会产生以下比较，表明确实data.table
占主导地位。请注意，我是在@BenBolker的建议之后添加此部分的
microbenchmark(dplyr= Reduce(mydplyr, df_list),
               dt = Reduce(mydt, dt_list), times=50)

Unit: seconds
expr      min       lq     mean   median       uq      max neval cld
dplyr 34.48993 34.85559 35.29132 35.11741 35.66051 36.66748    50   b
   dt 10.89544 11.32318 11.61326 11.54414 11.87338 12.77235    50  a 

查看一些数据。表
解决方案（外部联接
）。这可能会更快。谢谢@coffeinjunky。我尝试了基于上述线程的dplyr包，但不幸的是，在我的情况下，速度较慢。@coffeinjunky对于其中一个较小的数据集，我能够从77秒降到66秒。不是很神奇，但绝对有用：）你需要存储每个数字来自哪个表的结果吗？或者您要在最后聚合它吗？可能是数据。对于较大的大小，表占主导地位（OP希望合并具有5e5行而不是4e3行的帧…？）？你说得对@BenBolker，我的样本量太小了。更新了我的解决方案。
microbenchmark(dplyr= Reduce(mydplyr, df_list),
               dt = Reduce(mydt, dt_list), times=50)

Unit: seconds
expr      min       lq     mean   median       uq      max neval cld
dplyr 34.48993 34.85559 35.29132 35.11741 35.66051 36.66748    50   b
   dt 10.89544 11.32318 11.61326 11.54414 11.87338 12.77235    50  a