R 标识筛选行的原始行号_R_Data.table

R 标识筛选行的原始行号

R 标识筛选行的原始行号,r,data.table,R,Data.table,我想筛选行并将原始行号（源数据）存储在data.table列中我知道.I有一个语法变体（请参阅），但这需要我过滤两次我想要避免的内容： DT <- mtcars setDT(DT) row.numbers <- DT[, .I[ gear > 4]] # > row.numbers # [1] 27 28 29 30 31 DT[row.numbers, .(row.numbers, gear)] # row.numbers gear # 1:

我想筛选行并将原始行号（源数据）存储在

data.table

列中

我知道

.I

有一个语法变体（请参阅），但这需要我过滤两次我想要避免的内容：

DT <- mtcars
setDT(DT)
row.numbers <- DT[, .I[ gear > 4]]
# > row.numbers
# [1] 27 28 29 30 31
DT[row.numbers, .(row.numbers, gear)]
#    row.numbers gear
# 1:          27    5
# 2:          28    5
# 3:          29    5
# 4:          30    5
# 5:          31    5

有没有更简单/优化的解决方案？

您可以在筛选前添加一列行号：

library(data.table)
data.table(mtcars)[, rn := .I][gear > 4, .(rn, gear)]

标杆管理这只是一个带有

mtcars

数据集（32行）的快速基准测试，它非常小，但这里的重点是开销

microbenchmark::microbenchmark(
  copy = DT <- data.table(mtcars),
  ryoda = {
    DT <- data.table(mtcars)
    row.numbers <- DT[, .I[ gear > 4]]
    DT[row.numbers, .(row.numbers, gear)]
  },
  uwe = {
    DT <- data.table(mtcars)
    DT[, rn := .I][gear > 4, .(rn, gear)]
  },
  times = 1000L
)

请注意，每个基准测试运行都是从一个新的DT副本开始的，因为其中一个代码正在修改

DT

（使用

：=

）

在这里，似乎对微小的32行样本数据集进行链接会造成50到60微秒的损失。对于800m的大数据集，根据过滤行的数量（即

行的长度），行在两个方向上的差异约为1%。数字可以在过滤前添加一列行号：
library(data.table)
data.table(mtcars)[, rn := .I][gear > 4, .(rn, gear)]

标杆管理
这只是一个带有mtcars
数据集（32行）的快速基准测试，它非常小，但这里的重点是开销
microbenchmark::microbenchmark(
  copy = DT <- data.table(mtcars),
  ryoda = {
    DT <- data.table(mtcars)
    row.numbers <- DT[, .I[ gear > 4]]
    DT[row.numbers, .(row.numbers, gear)]
  },
  uwe = {
    DT <- data.table(mtcars)
    DT[, rn := .I][gear > 4, .(rn, gear)]
  },
  times = 1000L
)

请注意，每个基准测试运行都是从一个新的DT副本开始的，因为其中一个代码正在修改DT
（使用：=
）
在这里，似乎对微小的32行样本数据集进行链接会造成50到60微秒的损失。对于800m的大数据集，根据过滤行的数量，即行的长度，行在两个方向上的差异约为1%。数字编辑2:添加了@Frank的w2
变体
为了补充@UweBlock的公认答案，我做了一些基准测试，我想在这里展示这些基准测试，以分享结果：
library(data.table)
library(microbenchmark)
# size: about 800 MB
DT <- data.table(x = sample(1000, 1E8, replace = TRUE), y = sample(1000, 1E8, replace = TRUE))
LIMIT <- 500

microbenchmark(row.filter = {
  row.numbers <- DT[, .I[x > LIMIT]]
  res <- DT[row.numbers, .(row.numbers, x, y)]
},
chaining = {
  res <- DT[, row.number := .I][x > LIMIT, .(row.number, x, y)]
},
w2 = {
  w = DT[x > LIMIT, which = TRUE ]
  DT[w, c("x","y")][, w := w ]
},
times = 20)

编辑1:比较过滤器选择性的影响：
对于限制编辑2:添加了@Frank的w2
变体
为了补充@UweBlock的公认答案，我做了一些基准测试，我想在这里展示这些基准测试，以分享结果：
library(data.table)
library(microbenchmark)
# size: about 800 MB
DT <- data.table(x = sample(1000, 1E8, replace = TRUE), y = sample(1000, 1E8, replace = TRUE))
LIMIT <- 500

microbenchmark(row.filter = {
  row.numbers <- DT[, .I[x > LIMIT]]
  res <- DT[row.numbers, .(row.numbers, x, y)]
},
chaining = {
  res <- DT[, row.number := .I][x > LIMIT, .(row.number, x, y)]
},
w2 = {
  w = DT[x > LIMIT, which = TRUE ]
  DT[w, c("x","y")][, w := w ]
},
times = 20)

编辑1:比较过滤器选择性的影响：
对于@RYoda的答案中的示例，LIMIT稍微快一点：
w = DT[x > LIMIT, which = TRUE ]
DT[w, c("x","y")][, w := w ]

要更改结果中列的顺序，setcolorder
应该可以工作，几乎不需要花费时间。
对于@RYoda的答案中的示例，速度要快一点：
w = DT[x > LIMIT, which = TRUE ]
DT[w, c("x","y")][, w := w ]

要更改结果中列的顺序，setcolorder
应该可以工作，几乎不需要时间。为什么要使用另一个基准？
Frank在他的报告中提到，从（rn，gear）
切换到c（“rn”，“gear”）
可能会有一个加速，但没有单独进行基准测试
在中，示例数据的类型为integer，但限制为什么使用其他基准？
Frank在他的报告中提到，从（rn，gear）
切换到c（“rn”，“gear”）
可能会有一个加速，但没有单独进行基准测试
在中，示例数据的类型为integer，但可能是限制，但据我所知，data.table是i
参数中的子集，在j
参数中进行任何计算之前。可能是，但据我所知，data.table是i
参数中的子集，在j
参数中进行任何计算之前。是的，链接是一种选项，但也可能会降低性能成本（尽管它看起来很小，因为只追加了一个新的列/向量）。但是过滤只做了一次，很好的回答！哦，真有趣。我知道管道（%%>%%
）可能会影响性能，但我不知道链接也可能会影响性能（我相信数据表
的人非常注重效率）。你有这样的例子吗？幸运的是，我又找到了：（这与内存使用有关，但内存使用可能会影响性能）非常感谢你的链接。我想知道今天，2年半的时间，以及以后对数据表的大量优化，会如何回答这个问题。绝对正确，值得一提，因为@arun参与了相关讨论。我想已经有这么一个问题，答案或评论带我到这个链接，但我没有找到它。是的，链接是一个选项，但也可能会降低性能（即使它看起来很小，因为只有一个新的列/向量被附加）。但是过滤只做了一次，很好的回答！哦，真有趣。我知道管道（%%>%%
）可能会影响性能，但我不知道链接也可能会影响性能（我相信数据表
的人非常注重效率）。你有这样的例子吗？幸运的是，我又找到了：（这与内存使用有关，但内存使用可能会影响性能）非常感谢你的链接。我想知道今天，2年半的时间，以及以后对数据表的大量优化，会如何回答这个问题。绝对正确，值得一提，因为@arun参与了相关讨论。我想已经有这么一个问题，答案或评论带我到这个链接，但我没有发现它。很好，特别是检查过滤器的选择性。似乎对于较少的行（较短的row.number
vector，使用x>900
进行过滤），row.filter
方法大约快1.5%，而对于较长的row.number
vector，链接大约快0.5%。这是另一个基准，虽然我没有耐心运行它超过3倍：而且w2={w=DT[x>th，它=TRUE]；DT[w，c（“x”，“y”）][，w:=w]}
对我来说是最快的。（如果您想先获得行号，可以在结果上使用setcolorder。）我正在更新基准测试
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval cld
 row.filter 900.9504 905.0694 914.9406 907.5211 916.2071  964.6856    20  b 
   chaining 927.1630 932.0981 965.8222 970.9336 981.5885 1030.6396    20   c
         w2 607.0091 609.8028 620.5582 612.0490 615.2337  669.9706    20 a  

w = DT[x > LIMIT, which = TRUE ]
DT[w, c("x","y")][, w := w ]

dcast(bm_med[limit == 500L & type == "int"][
  , expr := forcats::fct_reorder(factor(expr), -time)],
  expr ~ n_rows, fun.aggregate = function(x) max(x/1E6), value.var = "time")

           expr       100      1000     10000    1e+05    1e+06    1e+07    1e+08
1: chaining_nse 0.8189745 0.8493695 1.0115405 2.870750 22.34469 441.1621 2671.179
2:   row.filter 0.7693225 0.7972635 0.9622665 2.677807 21.30861 247.3984 2677.495
3:    which_nse 0.8486145 0.8690035 1.0117295 2.620980 18.39406 219.0794 2341.990
4:  chaining_se 0.5299360 0.5582545 0.6454755 1.700626 12.48982 166.0164 2049.904
5:     which_se 0.5894045 0.6114935 0.7040005 1.624166 13.00125 130.0718 1289.050

library(data.table)
library(microbenchmark)
run_bm <- function(n_rows, limit = 500L, type = "int") {
  set.seed(1234L)
  DT <- data.table(x = sample(1000, n_rows, replace = TRUE), 
                   y = sample(1000, n_rows, replace = TRUE))
  LIMIT <- switch(type,
                  int = as.integer(limit),
                  dbl = as.double(limit))
  times <- round(scales::squish(sqrt(1E8 / n_rows) , c(3L, 100L)))
  cat("Start run:", n_rows, limit, type, times, "\n")
  microbenchmark(row.filter = {
    row.numbers <- DT[, .I[x > LIMIT]]
    DT[row.numbers, .(row.numbers, x, y)]
  },
  chaining_nse = {
    DT[, row.number := .I][x > LIMIT, .(row.number, x, y)]
  },
  chaining_se = {
    DT[, row.number := .I][x > LIMIT, c("row.number", "x", "y")]
  },
  which_nse = {
    row.numbers <- DT[x > LIMIT, which = TRUE ]
    DT[row.numbers, .(x, y)][, row.numbers := row.numbers ][]
  },
  which_se = {
    row.numbers <- DT[x > LIMIT, which = TRUE ]
    DT[row.numbers, c("x", "y")][, row.numbers := row.numbers][]
  },
  times = times)
}
# parameter
bm_par <- CJ(n_rows = 10^seq(2L, 8L, 1L), 
             limit = seq(100L, 900L, 400L), 
             type = c("int", "dbl"))
# run the benchmarks
bm_raw <- bm_par[, run_bm(n_rows, limit, type), by = .(n_rows, limit, type)]
# aggregate results
bm_med <- bm_raw[, .(time = median(time)), by = .(n_rows, limit, type, expr)]

library(ggplot2)

# chart 1
ggplot(
  dcast(bm_med, n_rows + limit + expr ~ type, value.var = "time")[
    , ratio := dbl / int - 1.0] #[limit == 500L]
) + 
  aes(n_rows, ratio, colour = expr) +
  geom_point() + 
  geom_line() + 
  facet_grid(limit ~ expr) + 
  scale_x_log10(labels = function(x) scales::math_format()(log10(x))) +
  scale_y_continuous(labels = scales::percent) + 
  coord_cartesian(ylim = c(-0.1, 0.5)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  ggtitle("Performance loss due to type conversion") +
  ylab("Relative computing time dbl vs int") + 
  xlab("Number of rows (log scale)")
ggsave("p2.png")

# chart 2
ggplot(
  dcast(bm_med[, c("code", "eval") := tstrsplit(expr, "_")][!is.na(eval)], 
        n_rows + limit + type + code ~ eval, value.var = "time")[
          , ratio := nse / se - 1.0][type == "int"]
) + 
  aes(n_rows, ratio, colour = code) +
  geom_point() + 
  geom_line() + 
  facet_grid(limit  + type ~ code) + 
  scale_x_log10(labels = function(x) scales::math_format()(log10(x))) +
  scale_y_continuous(labels = scales::percent) + 
  geom_hline(yintercept = 0) +
  theme_bw() +
  ggtitle("Performance loss due to non standard evaluation") +
  ylab("Relative computing time NSE vs SE") + 
  xlab("Number of rows (log scale)")
ggsave("p3.png")

# chart 3
ggplot(bm_med[limit == 500L][type == "int"]) + 
  aes(n_rows, time/1E6, colour = expr) +
  geom_point() + 
  geom_smooth(se = FALSE) + 
  facet_grid(limit ~ type) +
  facet_grid(type ~ limit) +
  scale_x_log10(labels = function(x) scales::math_format()(log10(x))) +
  scale_y_log10(labels = function(x) scales::math_format()(log10(x))) +
  theme_bw() +
  ggtitle("Benchmark results (log-log scale)") +
  ylab("Computing time in ms (log scale)") + 
  xlab("Number of rows (log scale)")
ggsave("p1.png")