R为不使用循环的记录分配ID
我需要遍历数据表a,并根据条件为该记录或记录组分配一个增量ID,如:R为不使用循环的记录分配ID,r,loops,data.table,vectorization,R,Loops,Data.table,Vectorization,我需要遍历数据表a,并根据条件为该记录或记录组分配一个增量ID,如: library(data.table) A <- data.table(x = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14), y = c(2,2,2,2,2,2,2,2,3,3,3,3,3,3), z = 0) for(i in 1:nrow(A)) { if((A[i]$x %% A[i]$y) == 0) {A[i]$z <- i} print(i) } z
library(data.table)
A <- data.table(x = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
y = c(2,2,2,2,2,2,2,2,3,3,3,3,3,3), z = 0)
for(i in 1:nrow(A))
{
if((A[i]$x %% A[i]$y) == 0) {A[i]$z <- i}
print(i)
}
z列成为一种滚动ID。
我需要在不使用循环的情况下执行相同的操作 您可以获取%%operator返回0的索引,并在该位置分配索引值
inds <- A$x %% A$y == 0
A$z[inds] <- which(inds)
A
# x y z
# 1: 1 2 0
# 2: 2 2 2
# 3: 3 2 0
# 4: 4 2 4
# 5: 5 2 0
# 6: 6 2 6
# 7: 7 2 0
# 8: 8 2 8
# 9: 9 3 9
#10: 10 3 0
#11: 11 3 0
#12: 12 3 12
#13: 13 3 0
#14: 14 3 0
您可以获取%%operator返回0的索引,并在该位置分配索引值
inds <- A$x %% A$y == 0
A$z[inds] <- which(inds)
A
# x y z
# 1: 1 2 0
# 2: 2 2 2
# 3: 3 2 0
# 4: 4 2 4
# 5: 5 2 0
# 6: 6 2 6
# 7: 7 2 0
# 8: 8 2 8
# 9: 9 3 9
#10: 10 3 0
#11: 11 3 0
#12: 12 3 12
#13: 13 3 0
#14: 14 3 0
或者您可以尝试此sinde x已包含索引值 在满足条件x%%y==0的行上,通过引用z值和x值进行更新。在所有其他行上,z保持其原始值,即0
A[ x %% y == 0, z:=x]
# x y z
# 1: 1 2 0
# 2: 2 2 2
# 3: 3 2 0
# 4: 4 2 4
# 5: 5 2 0
# 6: 6 2 6
# 7: 7 2 0
# 8: 8 2 8
# 9: 9 3 9
# 10:10 3 0
# 11:11 3 0
# 12:12 3 12
# 13:13 3 0
# 14:14 3 0
当然,您也可以使用.I来获取行的索引
A[ x %% y == 0, z := .I]
也会起作用。。。根据您的列类,您必须将一些整数列设置为class double,以避免出现警告消息
基准
多达50000行,Ronaks的回答速度更快,除此之外,.I解决方案是“赢”
用于基准测试的代码
vec <- c( seq( 1,10000, by = 1000), seq( 1,100000, by = 10000),
seq( 1,1000000, by = 100000), seq( 1,10000000, by = 1000000) )
l <- lapply( vec, function(x){
A <- data.table(x = as.double( 1:x ),
y = as.double( sample(2:3, x, replace = TRUE) ),
z = as.double(0) )
m <- microbenchmark::microbenchmark(
Ronak = {
DT <- copy(A)
inds <- DT$x %% DT$y == 0
DT$z[inds] <- which(inds)
},
Wimpel = {
DT <- copy(A)
DT[ x %% y == 0, z:=as.double(.I)]
},
times = 10 )
setDT(m)[, .(n = x, median = median(time)), by = .(expr)][]
})
library(scales)
library(ggplot2)
ggplot( data = rbindlist(l), aes( x = n, y = median/1000000, group = expr, colour = expr )) +
geom_smooth( se = FALSE ) +
labs( x = "rows",
y = "median [ms]" )
或者您可以尝试此sinde x已包含索引值 在满足条件x%%y==0的行上,通过引用z值和x值进行更新。在所有其他行上,z保持其原始值,即0
A[ x %% y == 0, z:=x]
# x y z
# 1: 1 2 0
# 2: 2 2 2
# 3: 3 2 0
# 4: 4 2 4
# 5: 5 2 0
# 6: 6 2 6
# 7: 7 2 0
# 8: 8 2 8
# 9: 9 3 9
# 10:10 3 0
# 11:11 3 0
# 12:12 3 12
# 13:13 3 0
# 14:14 3 0
当然,您也可以使用.I来获取行的索引
A[ x %% y == 0, z := .I]
也会起作用。。。根据您的列类,您必须将一些整数列设置为class double,以避免出现警告消息
基准
多达50000行,Ronaks的回答速度更快,除此之外,.I解决方案是“赢”
用于基准测试的代码
vec <- c( seq( 1,10000, by = 1000), seq( 1,100000, by = 10000),
seq( 1,1000000, by = 100000), seq( 1,10000000, by = 1000000) )
l <- lapply( vec, function(x){
A <- data.table(x = as.double( 1:x ),
y = as.double( sample(2:3, x, replace = TRUE) ),
z = as.double(0) )
m <- microbenchmark::microbenchmark(
Ronak = {
DT <- copy(A)
inds <- DT$x %% DT$y == 0
DT$z[inds] <- which(inds)
},
Wimpel = {
DT <- copy(A)
DT[ x %% y == 0, z:=as.double(.I)]
},
times = 10 )
setDT(m)[, .(n = x, median = median(time)), by = .(expr)][]
})
library(scales)
library(ggplot2)
ggplot( data = rbindlist(l), aes( x = n, y = median/1000000, group = expr, colour = expr )) +
geom_smooth( se = FALSE ) +
labs( x = "rows",
y = "median [ms]" )
对于较大的行,这里有一些稍微快一点的内容,与较小的行计数非常相似。A[['z']]对于较大的行,这里有一些稍微快一点的内容,与较小的行计数非常相似。A[['z']]