R 如何尽可能有效地比较两个大型数据集的元素？_R_Dataframe_For Loop_Compare_Subset

R 如何尽可能有效地比较两个大型数据集的元素？

r dataframe for-loop

R 如何尽可能有效地比较两个大型数据集的元素？,r,dataframe,for-loop,compare,subset,R,Dataframe,For Loop,Compare,Subset,我是个业余爱好者，学习很慢。我现在介绍情况：我有两个数据帧，其中有几列（4）和+10000行，如下所示： df1: df2: Nº x y attr Nº x y attr 1 45 34 X 1 34 23 x 1 48 45 XX 4 123 45 x 1 41 23 X 4 99 69 xx 4 23 12 X

我是个业余爱好者，学习很慢。我现在介绍情况：

我有两个数据帧，其中有几列（4）和+10000行，如下所示：

df1:                   df2:
Nº x   y   attr        Nº x   y   attr
1  45  34  X           1  34  23  x
1  48  45  XX          4  123 45  x
1  41  23  X           4  99  69  xx
4  23  12  X           4  112 80  xx
4  28  16  X           5  78  80  x
5  78  80  XXX         5  69  74  xx
...

我想比较基于x，y（坐标）的两个数据帧，以删除df1中也出现在df2中的所有值（两个数据集中包含的所有值/坐标，在df1中删除它们）

因此，在我的示例中，df1的最后一行将被删除，因为df2中有相同的坐标

我所做的是使用双循环for（），一个用于一个数据集，另一个用于另一个数据集，逐个比较所有可能的值。我知道这是非常低效的，如果我增加数据量，也需要很多时间

还有什么其他方法可以做到这一点？可能有一些函数，但我通常不知道如何使用它们这么多，这给我带来了问题

非常感谢

不是最优雅的解决方案，但可以完成任务：

df2 = fread('Nº x   y   attr
1  34  23  x
4  123 45  x
4  99  69  xx
4  112 80  xx
5  78  80  x
5  69  74  xx')

df1 = fread('Nº x   y   attr        
1  45  34  X           
1  48  45  XX          
1  41  23  X          
4  23  12  X         
4  28  16  X        
5  78  80  XXX')

说明：

最好使用函数而不是循环<代码>！%stringr:：str_c（df2$x，df2$y，sep=“”）中的stringr:：str_c（df2$x，df2$y，sep=“”）将x和y列连接成字符串，然后从df1中查找不在df2中的元素。这将创建一个真-假值的逻辑向量，然后我们可以使用它来子集df1

编辑：

我很想知道我的答案或@dww的答案是否更快：

> library(microbenchmark)
> 
> n=100000
> 
> df1 = data.table(x = sample(n), y=sample(n))
> df2 = data.table(x = sample(n), y=sample(n))
> 
> 
> 
> microbenchmark(
... df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),],
... df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
... )
Unit: milliseconds
                                                                                              expr
 df1[!stringr::str_c(df1$x, df1$y, sep = "_") %in% stringr::str_c(df2$x,      df2$y, sep = "_"), ]
                                   df1[fsetdiff(df1[, .(x, y)], df2[, .(x, y)]), on = c("x", "y")]
       min        lq      mean    median        uq      max neval
 168.40953 199.37183 219.30054 209.61414 222.08134 364.3458   100
  41.07557  42.67679  52.34855  44.34379  59.27378 152.1283   100

看起来data.table版本的dww快了约5倍。

不是最优雅的解决方案，但可以完成任务：

df2 = fread('Nº x   y   attr
1  34  23  x
4  123 45  x
4  99  69  xx
4  112 80  xx
5  78  80  x
5  69  74  xx')

df1 = fread('Nº x   y   attr        
1  45  34  X           
1  48  45  XX          
1  41  23  X          
4  23  12  X         
4  28  16  X        
5  78  80  XXX')

说明：

编辑：

我很想知道我的答案或@dww的答案是否更快：

> library(microbenchmark)
> 
> n=100000
> 
> df1 = data.table(x = sample(n), y=sample(n))
> df2 = data.table(x = sample(n), y=sample(n))
> 
> 
> 
> microbenchmark(
... df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),],
... df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
... )
Unit: milliseconds
                                                                                              expr
 df1[!stringr::str_c(df1$x, df1$y, sep = "_") %in% stringr::str_c(df2$x,      df2$y, sep = "_"), ]
                                   df1[fsetdiff(df1[, .(x, y)], df2[, .(x, y)]), on = c("x", "y")]
       min        lq      mean    median        uq      max neval
 168.40953 199.37183 219.30054 209.61414 222.08134 364.3458   100
  41.07557  42.67679  52.34855  44.34379  59.27378 152.1283   100

看起来dww的data.table版本快了约5倍。

库（data.table）

方法：

df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
#   Nº  x  y attr
#1:  1 45 34    X
#2:  1 48 45   XX
#3:  1 41 23    X
#4:  4 23 12    X
#5:  4 28 16    X

库（data.table）

方法：

df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
#   Nº  x  y attr
#1:  1 45 34    X
#2:  1 48 45   XX
#3:  1 41 23    X
#4:  4 23 12    X
#5:  4 28 16    X

3行代码

#generate sample data
x1 <- sample(1:50,9001, T)
y1 <- sample(1:50,9001, T)

x2 <- sample(1:50,9001, T)
y2 <- sample(1:50,9001, T)

df1 <- data.frame(id =1:9001, x1,y1, stringsAsFactors = F)
df2 <- data.frame(id =1:9001, x2,y2, stringsAsFactors = F)

#add a match column to each dataframe
df1$match <- paste(df1$x1, df1$y1)
df2$match <- paste(df2$x2, df2$y2)

#overwrite df1 with the date of df1 that does not appear in df2
df1 <- df1[!df1$match %in% df2$match,]

#生成样本数据
x13行代码
#generate sample data
x1 <- sample(1:50,9001, T)
y1 <- sample(1:50,9001, T)

x2 <- sample(1:50,9001, T)
y2 <- sample(1:50,9001, T)

df1 <- data.frame(id =1:9001, x1,y1, stringsAsFactors = F)
df2 <- data.frame(id =1:9001, x2,y2, stringsAsFactors = F)

#add a match column to each dataframe
df1$match <- paste(df1$x1, df1$y1)
df2$match <- paste(df2$x2, df2$y2)

#overwrite df1 with the date of df1 that does not appear in df2
df1 <- df1[!df1$match %in% df2$match,]

#生成样本数据
x1可能：；也许：；