R 什么'；What’对数据表进行子集的最快方法是什么？_R_Data.table

R 什么'；What’对数据表进行子集的最快方法是什么？

R 什么'；What’对数据表进行子集的最快方法是什么？,r,data.table,R,Data.table,在我看来，创建数据表的行/列子集的最快方法是使用join和nomatch选项这是正确的吗 DT = data.table(rep(1:100, 100000), rep(1:10, 1000000)) setkey(DT, V1, V2) system.time(DT[J(22,2), nomatch=0L]) # user system elapsed # 0.00 0.00 0.01 system.time(subset(DT, (V1==22) & (V2==2

在我看来，创建

数据表的行/列子集的最快方法是使用join和nomatch
选项
这是正确的吗
DT = data.table(rep(1:100, 100000), rep(1:10, 1000000))
setkey(DT, V1, V2)
system.time(DT[J(22,2), nomatch=0L])
# user  system elapsed 
# 0.00    0.00    0.01 
system.time(subset(DT, (V1==22) & (V2==2)))
# user  system elapsed 
# 0.45    0.21    0.67 

identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2)))
# [1] TRUE

基于二进制搜索的快速连接也有一个问题：我找不到一种方法来选择一维中的所有项目
如果我想随后做：
DT[J(22,2), nomatch=0]  # subset on TWO dimensions
DT[J(22,), nomatch=0]   # subset on ONE dimension only
# Error in list(22, ) : argument 2 is empty

无需将键重新设置为仅一个维度（因为我处于循环中，不希望每次都停止键）。
将数据子集的最快方法是什么。表？
使用基于二进制搜索的子集功能是最快的。请注意，子集需要选项nomatch=0L
，以便仅返回匹配结果
如何在设置了两个关键点的情况下仅按其中一个关键点进行子集？
如果您在DT
上设置了两个键，并且希望按第一个键进行子集，那么您只需在J（.）
中提供第一个值，无需为第二个键提供任何内容。即:
# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L] 

# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]

相反，如果您希望按第二个键进行子集，那么从现在起，您必须为第一个键提供所有唯一的值。即:
# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L] 

# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]

这也显示了。虽然我更喜欢用DT[J（，2）]
来处理这种情况，因为这看起来相当直观
还有一个挂起的特性请求，用于实现辅助密钥，完成后将解决这个问题
下面是一个更好的例子：
DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
#    V1 V2
# 1:  1  2
# 2:  2  3
# 3:  3  2
# 4:  4  3
# 5:  5  2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
#    V1 V2
# 1:  1  2
# 2:  2  2
# 3:  3  2
# 4:  4  2
# 5:  5  2
DT[J(unique(V1),2), nomatch=0L]
#    V1 V2
# 1:  1  2
# 2:  3  2
# 3:  5  2
DT[J(3), nomatch=0L]
#    V1 V2
# 1:  3  2

总之：
# key(DT) = c("V1", "V2")

# data.frame                        |             data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) |            DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3))             |              DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2))             |  DT[J(unique(V1), 2), nomatch=0L]

这里有关于公共数据表操作（包括子集设置）计时的良好文档：我查看了文档第1.1段的提取。在我看来，除非添加nomatch选项，否则提取并不完全是子集。此外，这并没有涉及模拟的1维和2维子集（见我的相关更新）。@tucson这是两个完全不同的问题，需要不同的答案，所以他们可能有自己的帖子，但没关系。请参阅以获取第二个问题的解决方案。需要记住的一个重要警告是，对于单个子集，DT[V1==3&V2==2]
将比设置键和进行二进制搜索更快-因此，只有在已设置键或您对每个setkey
进行多次搜索时，二进制搜索才有意义