R 如何在大型数据表（57M obs）中快速搜索？_R_Data.table_Sqldf

R 如何在大型数据表（57M obs）中快速搜索？

R 如何在大型数据表（57M obs）中快速搜索？,r,data.table,sqldf,R,Data.table,Sqldf,如何使用sqldf在data.table内快速搜索我需要一个基于其他两个列值返回列data.table值的函数： requiredata.table dt 基本预测计数 1:258586 第2组：246646 3:当然是137533 4：主唱和背景人声来自4 5：你救我脱离这四个 6:面对4，沉默函数需要根据输入基值的最大计数值返回预测值。为功能提供输入： >预测所需输出为： > 或： >你救了我吗 > 此处提供的解决方案适用于小数据集，但不适用于超大数据。表57M obs： f1 &l

如何使用sqldf在data.table内快速搜索

我需要一个基于其他两个列值返回列data.table值的函数：

requiredata.table dt 基本预测计数 1:258586 第2组：246646 3:当然是137533 4：主唱和背景人声来自4 5：你救我脱离这四个 6:面对4，沉默函数需要根据输入基值的最大计数值返回预测值。为功能提供输入： >预测所需输出为： > 或： >你救了我吗 > 此处提供的解决方案适用于小数据集，但不适用于超大数据。表57M obs：

f1 <- function(val) dt[base == val, prediction[which.max(count)]]

我读到索引data.table和使用sqldf函数搜索可以加快速度，但还不知道如何做到这一点

感谢您的帮助。

使用sqldf，情况会是这样的。如果无法将dbname=tempfile参数放入内存中，请添加该参数

library(sqldf)

val <- "of"
fn$sqldf("select max(count) count, prediction from dt where base = '$val'")
##   count prediction
##1 258586        the

或者，要直接使用RSQLite设置数据库并创建索引，请执行以下操作：

library(gsubfn)
library(RSQLite)

con <- dbConnect(SQLite(), "dt.db")
dbWriteTable(con, "dt", dt)
dbExecute(con, "create index idx on dt(base)")

val <- "of"
fn$dbGetQuery(con, "select max(count) count, prediction from dt where base = '$val'")
##    count prediction
## 1 258586        the

dbDisconnect(con)

笔记首先运行以下命令：

library(data.table)

dt <- data.table(
    "base" = c("of", "of", "of", "lead and background vocals", 
     "save thou me from", "silent in the face"),
    "prediction" = c("the", "set", "course", "from", "the", "of"),
    "count" = c(258586, 246646, 137533, 4, 4, 4)
)

使用sqldf会是这样的。如果无法将dbname=tempfile参数放入内存中，请添加该参数

library(sqldf)

val <- "of"
fn$sqldf("select max(count) count, prediction from dt where base = '$val'")
##   count prediction
##1 258586        the

或者，要直接使用RSQLite设置数据库并创建索引，请执行以下操作：

library(gsubfn)
library(RSQLite)

con <- dbConnect(SQLite(), "dt.db")
dbWriteTable(con, "dt", dt)
dbExecute(con, "create index idx on dt(base)")

val <- "of"
fn$dbGetQuery(con, "select max(count) count, prediction from dt where base = '$val'")
##    count prediction
## 1 258586        the

dbDisconnect(con)

笔记首先运行以下命令：

library(data.table)

dt <- data.table(
    "base" = c("of", "of", "of", "lead and background vocals", 
     "save thou me from", "silent in the face"),
    "prediction" = c("the", "set", "course", "from", "the", "of"),
    "count" = c(258586, 246646, 137533, 4, 4, 4)
)

您可以考虑仅使用DATA表如下。我认为它可以显著提高速度

dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals", "save thou me from", 
"silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)

# set the key on both base and count.
# This rearranges the data such that the max value of count for each group in base 
# corresponds to the last row.
setDT(dt, key = c("base", "count"))

# for a given group in base, we consider only the last value of prediction as it is 
# on the same row with the max value of count. 
prediction <- function(x) {
  dt[.(x), prediction[.N] ]
}

prediction("of")
#"the"
prediction("save thou me from")
#"the"

您可以考虑仅使用DATA表如下。我认为它可以显著提高速度

dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals", "save thou me from", 
"silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)

# set the key on both base and count.
# This rearranges the data such that the max value of count for each group in base 
# corresponds to the last row.
setDT(dt, key = c("base", "count"))

# for a given group in base, we consider only the last value of prediction as it is 
# on the same row with the max value of count. 
prediction <- function(x) {
  dt[.(x), prediction[.N] ]
}

prediction("of")
#"the"
prediction("save thou me from")
#"the"

不能使用只要使用sqldf包如果你想使用sqlTry运行setkeydt，base，然后用dt[base==val，…]代替dt[val，…]并看看它是否更快。如果你反复运行这个，你可能想运行并存储keydt B。Christian Kamgang的答案很好。如果可能的话，我建议首先使用多个值的单个子集将您以后要查询的所有值收集到一个较小的集合中，然后继续使用该集合中的单个元素子集。如果您要使用sqlTry运行setkeydt，base，则不能使用sqldf包，然后使用dt[val，…]代替dt[base==val，…]如果你反复运行这个，你可能会想运行并存储keydt B。克里斯蒂安·卡冈的答案是非常好的。如果可能的话，我建议首先使用多个值上的单个子集将您以后要查询的所有值收集到较小的集合中，然后继续从该集合中子集单个元素。您好，Grothendieck，我遇到一个错误：>con启动R的新会话，并确保当前目录是可写的，并且dt.db不存在。然后将注释中的代码复制并粘贴到R中，然后将代码复制并粘贴到正文中，然后交替复制并粘贴到R中。嗨，Grothendieck，我有一个错误：>继续启动R的新会话，并确保当前目录是可写的，并且dt.db不存在。然后将注释中的代码复制并粘贴到R中，然后将注释中的代码复制并粘贴到正文中，然后交替复制并粘贴到R中。为什么不将dt[.x，mult=last，prediction]？这将给出相同的结果。我认为至少在我的计算机上会更快。是的，它应该更快，因为连接的结果已经具体化为最后一个匹配项，这与使用.N进行子集时不同。首先连接将所有匹配物化，然后将子集化为最后一个。为什么不使用dt[.x，mult=last，prediction]？这将给出相同的结果。我认为它甚至会更快，至少在我的计算机上是这样。是的，它应该更快，因为联接的结果已经具体化为最后一个匹配项，这与使用子集时不同。N，首先联接使所有匹配物化，然后从子集到最后一个。