在R'中查找foverlap一次迭代中的所有重叠；s数据表_R_Data.table

在R'中查找foverlap一次迭代中的所有重叠；s数据表

在R'中查找foverlap一次迭代中的所有重叠；s数据表,r,data.table,R,Data.table,我试图使用data.table合并R中的一组重叠时间段。我接到一个电话，要把桌子放在桌子上，这已经足够有效了我的问题是这样的：假设周期A与周期B重叠，周期B与周期C重叠，但A与C不重叠。在这种情况下，A与C不分组，它们最终必须合并目前，我有一个while循环来查找重叠和合并，直到不再发生合并，但这并不完全是可伸缩的。我能看到的一个解决方案是将组的索引递归地应用到它自身，直到它稳定下来，但这看起来仍然需要一个循环，我想要一个完全矢量化的解决方案 dt = data.table(start =

我试图使用data.table合并R中的一组重叠时间段。我接到一个电话，要把桌子放在桌子上，这已经足够有效了

我的问题是这样的：假设周期A与周期B重叠，周期B与周期C重叠，但A与C不重叠。在这种情况下，A与C不分组，它们最终必须合并

目前，我有一个while循环来查找重叠和合并，直到不再发生合并，但这并不完全是可伸缩的。我能看到的一个解决方案是将组的索引递归地应用到它自身，直到它稳定下来，但这看起来仍然需要一个循环，我想要一个完全矢量化的解决方案

dt = data.table(start = c(1,2,4,6,8,10),
                end   = c(2,3,6,8,10,12))
setkeyv(dt,c("start","end"))

f = foverlaps(dt,
              dt,
              type="any",
              mult="first",
              which="TRUE")

#Needs to return [1,1,3,3,3,3]
print(f)
#1 1 3 3 4 5
print(f[f])
#1 1 3 3 3 4
print(f[f][f])
#1 1 3 3 3 3

有人能帮我介绍一下矢量化这个过程吗

使用ID编辑：

dt = data.table(id = c('A','A','A','A','A','B','B','B'),
                eventStart = c(1,2,4,6,8,10,11,15),
                eventEnd   = c(2,3,6,8,10,12,14,16))
setkeyv(dt,c("id","eventStart","eventEnd"))

f = foverlaps(dt,
              dt,
              type="any",
              mult="first",
              which="TRUE")

#Needs to return [1 1 3 3 3 6 6 8] or similar

Bioconductor上的

IRanges

包

data.table

的

foverlaps（）

的灵感来源于该包，该包具有一些方便的功能，可以解决此类问题

也许，

reduce（）

可能是您正在寻找的合并所有重叠时段的函数：

library(data.table)
dt = data.table(start = c(1,2,4,6,8,10),
                end   = c(2,3,6,8,10,12))

library(IRanges)
ir <- IRanges(dt$start, dt$end)

ir

关于生物导体，有一个全面的

编辑：OP提供了第二个样本数据集，其中包括一个

id

列，并询问

IRanges

是否支持通过

id

连接间隔

向

伊朗人添加数据

似乎很快就进入了基因组研究领域，这对我来说是一个陌生的领域。但是，我使用

IRanges

找到了以下方法：

使用

IRanges分组

在

数据表中分组
如果我们在data.table
中分组，并在单个块上应用reduce（）
，我们可以用更少的复杂代码获得相同的结果：
dt[, as.data.table(reduce(IRanges(eventStart, eventEnd), min.gapwidth = 0L)), id]

相关的，尽管相当复杂——或者甚至“需要返回[1,1,3,3,3]”——它的长度不应该是6吗？无论如何，如果你的时间间隔是这样构造的，开始>=lag（end），那么dt[，cumsum（开始移位（结束，填充=0）>0）]似乎可以工作。@Frank谢谢你，我无意中复制了一个早期版本。@Frank谢谢你，这是一个非常好的解决方案，非常简短！不幸的是，我的数据集只比示例中的数据集有一点点反常，一些时间间隔完全被其他时间间隔所包含，在这些情况下，这似乎有点过时了：/Uwe，这非常有用。您知道IRanges是否支持按ID加入间隔（请参见编辑）？
reduce(ir, min.gapwidth = 0L)

IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4        12         9

as.data.table(reduce(ir, min.gapwidth = 0L))

   start end width
1:     1   3     3
2:     4  12     9

library(data.table)
# 2nd sample data set provided by the OP
dt = data.table(id = c('A','A','A','A','A','B','B','B'),
                eventStart = c(1,2,4,6,8,10,11,15),
                eventEnd   = c(2,3,6,8,10,12,14,16))

library(IRanges)
# set names when constructing IRanges object
ir <- IRanges(dt$eventStart, dt$eventEnd, names = dt$id)

lapply(split(ir, names(ir)), reduce, min.gapwidth = 0L)

$A
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4        10         7

$B
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        14         5
  [2]        15        16         2

ir <- IRanges(dt$eventStart, dt$eventEnd, names = dt$id)
rbindlist(lapply(split(ir, names(ir)), 
                 function(x) as.data.table(reduce(x, min.gapwidth = 0L))), 
          idcol = "id")

   id start end width
1:  A     1   3     3
2:  A     4  10     7
3:  B    10  14     5
4:  B    15  16     2

dt[, as.data.table(reduce(IRanges(eventStart, eventEnd), min.gapwidth = 0L)), id]