R 大型数据集上的内部联接最佳实践_R_Parallel Processing_Dplyr_Bigdata

R 大型数据集上的内部联接最佳实践

r parallel-processing

R 大型数据集上的内部联接最佳实践,r,parallel-processing,dplyr,bigdata,R,Parallel Processing,Dplyr,Bigdata,我正在尝试使用dplyr:：internal\u join合并两个大型数据集（每个大约350万行）。我正在一台40多核的强大机器上工作。我不确定我是否利用了机器本身，因为我无论如何都没有并行化任务。我应该如何解决这个需要花费很多时间才能解决的问题最好的您应该尝试使用data.table包，对于大型数据集，它比dplyr要多得多。我已经从中复制了内部连接代码库（data.table） DT我不认为3.5M内部联接会有性能问题，除非由于数据集中关键列的重复（联接列的重复值），联接后两个最终数

我正在尝试使用

dplyr:：internal\u join

合并两个大型数据集（每个大约350万行）。我正在一台40多核的强大机器上工作。我不确定我是否利用了机器本身，因为我无论如何都没有并行化任务。我应该如何解决这个需要花费很多时间才能解决的问题

最好的

您应该尝试使用

data.table

包，对于大型数据集，它比dplyr要多得多。我已经从中复制了内部连接代码

库（data.table）
DT我不认为3.5M
内部联接会有性能问题，除非由于数据集中关键列的重复（联接列的重复值），联接后两个最终数据集将3.5M*3.5M

通常在R中，没有使用多核的函数。要做到这一点，您必须分批处理数据，这些数据可以单独处理，然后将最终结果合并在一起并进一步计算。下面是使用库dplyr
和doParallel

library(dplyr)
library(doParallel)

# Parallel configuration #####
cpuCount <- 10
# Note that doParallel will replicated your environment to and process on multiple core
# so if your environment is 10GB memory & you use 10 core
# it would required 10GBx10=100GB RAM to process data parallel
registerDoParallel(cpuCount)

data_1 # 3.5M rows records with key column is id_1 & value column value_1
data_2 # 3.5M rows records with key columns are id_1 & id_2

# Goal is to calculate some stats/summary of value_1 for each combination of id_1 + id_2
id_1_unique <- unique(data_1$id_1)
batchStep <- 1000
batch_id_1 <- seq(1, length(id_1_unique )+batchStep , by=batchStep )

# Do the join for each batch id_1 & summary/calculation then return the final_data
# foreach will result a list, for this psuedo code it is a list of datasets
# which can be combined use bind_rows
summaryData <- bind_rows(foreach(index=1:(length(batch_id_1)-1)) %dopar% {
    batch_id_1_current <- id_1_unique[index:index+batchStep-1]
    batch_data_1 <- data_1 %>% filter(id_1 %in% batch_id_1_current)
    joined_data <- inner_join(batch_data_1, data_2, by="id_1")
    final_data <- joined_data %>%
        group_by(id_1, id_2) %>%
        #calculation code here
        summary(calculated_value_1=sum(value_1)) %>%
        ungroup()
    return(final_data)
})

库（dplyr）
图书馆（双平行）
#并行配置#####
cpuCount我也遇到了同样的问题，您必须使用data.table
。这里有一些关于合并大型数据集的一般提示。Shi@Sinh，您是如何获得批处理数据的？你能编辑你的代码来显示吗？谢谢。对不起，我的不好-没有batch\u data\u 2
只是internal\u join
与data\u 2
自动限制为仅匹配batch\u data\u 1的行。如果两个数据集太大，无法同时加载，则取决于内存限制-您可能希望将它们分开，保存在磁盘上，并分别处理每个批处理文件。
library(dplyr)
library(doParallel)

# Parallel configuration #####
cpuCount <- 10
# Note that doParallel will replicated your environment to and process on multiple core
# so if your environment is 10GB memory & you use 10 core
# it would required 10GBx10=100GB RAM to process data parallel
registerDoParallel(cpuCount)

data_1 # 3.5M rows records with key column is id_1 & value column value_1
data_2 # 3.5M rows records with key columns are id_1 & id_2

# Goal is to calculate some stats/summary of value_1 for each combination of id_1 + id_2
id_1_unique <- unique(data_1$id_1)
batchStep <- 1000
batch_id_1 <- seq(1, length(id_1_unique )+batchStep , by=batchStep )

# Do the join for each batch id_1 & summary/calculation then return the final_data
# foreach will result a list, for this psuedo code it is a list of datasets
# which can be combined use bind_rows
summaryData <- bind_rows(foreach(index=1:(length(batch_id_1)-1)) %dopar% {
    batch_id_1_current <- id_1_unique[index:index+batchStep-1]
    batch_data_1 <- data_1 %>% filter(id_1 %in% batch_id_1_current)
    joined_data <- inner_join(batch_data_1, data_2, by="id_1")
    final_data <- joined_data %>%
        group_by(id_1, id_2) %>%
        #calculation code here
        summary(calculated_value_1=sum(value_1)) %>%
        ungroup()
    return(final_data)
})