R数据表：对除当前行以外的组使用函数_R_Data.table

R数据表：对除当前行以外的组使用函数

R数据表：对除当前行以外的组使用函数,r,data.table,R,Data.table,假设我有： x = data.table( id=c(1,1,1,2,2,2), price=c(100,110,120,200,200,220) ) > x id price 1: 1 100 2: 1 110 3: 1 120 4: 2 200 5: 2 200 6: 2 220 并希望在忽略当前行后，为每行查找组中最便宜的价格（by=id）。因此，结果应该如下所示： > x id price cheapest_in_thi

假设我有：

x = data.table( id=c(1,1,1,2,2,2), price=c(100,110,120,200,200,220) )
> x
   id price
1:  1   100
2:  1   110
3:  1   120
4:  2   200
5:  2   200
6:  2   220

并希望在忽略当前行后，为每行查找组中最便宜的价格（by=id）。因此，结果应该如下所示：

> x
   id price   cheapest_in_this_id_omitting_current_row
1:  1   100   110       # if I take this row out the cheapest is the next row
2:  1   110   100       # row 1
3:  1   120   100       # row 1
4:  2   200   200       # row 5
5:  2   200   200       # row 4 (or 5)
6:  2   220   200       # row 4 (or 5)

因此，这就像使用：

x[, cheapest_by_id := min(price), id]

但删除每个计算的当前行

如果我可以有一个引用组中当前行的变量，比如.row_nb，我会使用：

x[, min(price[-.row_nb]), id]

但这一行似乎不存在…？

我们按“id”分组，在行序列上使用

combn

，指定要选择的元素数，即“m”比行数少1（

.N-1

），使用

combn

的输出作为数字索引对“价格”进行子集，获得

min

，然后分配（

：=

）将输出作为新列

 x[,  cheapest_in_this_id_omitting_current_row:= 
             combn(.N:1, .N-1, FUN=function(i) min(price[i])), by = id]
x
#   id price cheapest_in_this_id_omitting_current_row
#1:  1   100                                      110
#2:  1   110                                      100
#3:  1   120                                      100
#4:  2   200                                      200
#5:  2   200                                      200
#6:  2   220                                      200

或者不使用

combn

，我们可以循环序列，用它来索引“价格”，得到

平均值

。我想这会很快

 x[,cheapest_in_this_id_omitting_current_row:=
          unlist(lapply(1:.N, function(i) min(price[-i]))) , id]

还有一种方法：

x[order(price), min_other_p := c(price[2], rep(price[1], .N-1)), by = id]
# or
x[order(price), min_other_p := replace( rep(price[1], .N), 1, price[2] ), by = id]


   id price min_other_p
1:  1   100         110
2:  1   110         100
3:  1   120         100
4:  2   200         200
5:  2   200         200
6:  2   220         200

在OP的示例中，

中的

顺序是不必要的，但通常是需要的

它是如何工作的。我们用order
按递增顺序对价格向量进行排序，以便price[1]
和price[2]
是每组中观察到的最低的两个价格。因此，我们想要price[1]
——总体最低价格——除了位置1之外的所有地方，我们想要下一个最低价格
更明确地说：假设我们已经进行了排序，因此，i==1
是一个组中价格最低的行；i==2
，第二个最低的行，依此类推。然后price[1]
是一个组中价格向量的第一个，并且price[2]
是价格向量的二阶统计量。很明显
# pseudocode
min(price[-i]) == price[2] if i==1, since price[2] == min(price[2:.N])
min(price[-i]) == price[1] otherwise, since price[1] belongs to price[-i] and is smallest

@akrun我现在正在尝试它。只是有很多数据，所以需要一段时间。在这个小例子上确实有效。我只是很难弄清楚它是如何工作的。谢谢你的帮助。我尝试了sapply
来代替unlist
-lappy
，它似乎也能工作。@Frank如果元素的数量相同，>sapply
转换为矩阵。但是，我更喜欢lappy
与unlist
一起，以避免任何意外。这是一个相当深刻和令人惊讶的答案，加深了我对组合的理解+1@akrun谢谢。我已经添加了一个解释。我希望它清楚。这对我来说有点奇怪像这样“删除当前行”，我必须重新检查OP以了解其含义。我一直在思考工作中的顺序统计信息。我觉得在每组中使用这两个值有点棘手。这应该很快。