Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/video/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用data.table函数foverlaps查找两个表中重叠范围的交点_R_Data.table_Bioinformatics - Fatal编程技术网

使用data.table函数foverlaps查找两个表中重叠范围的交点

使用data.table函数foverlaps查找两个表中重叠范围的交点,r,data.table,bioinformatics,R,Data.table,Bioinformatics,我想使用foverlaps查找两个床文件的相交范围,并将包含重叠范围的任何行折叠为一行。在下面的例子中,我有两个带有基因组范围的表格。这些表格被称为“bed”文件,其中染色体中特征的起始坐标为零,结束位置为一。例如,START=9,STOP=20被解释为跨越基数10到20,包括10到20。这些床文件可以包含数百万行。解决方案需要给出相同的结果,而不管提供的两个文件相交的顺序如何 第一桌 > table1 CHROMOSOME START STOP 1: 1

我想使用foverlaps查找两个床文件的相交范围,并将包含重叠范围的任何行折叠为一行。在下面的例子中,我有两个带有基因组范围的表格。这些表格被称为“bed”文件,其中染色体中特征的起始坐标为零,结束位置为一。例如,START=9,STOP=20被解释为跨越基数10到20,包括10到20。这些床文件可以包含数百万行。解决方案需要给出相同的结果,而不管提供的两个文件相交的顺序如何

第一桌

> table1
   CHROMOSOME START STOP
1:          1     1   10
2:          1    20   50
3:          1    70  130
4:          X     1   20
5:          Y     5  200
第二桌

> table2
   CHROMOSOME START STOP
1:          1     5   12
2:          1    15   55
3:          1    60   65
4:          1   100  110
5:          1   130  131
6:          X    60   80
7:          Y     1   15
8:          Y    10   50
我认为新的foverlaps函数可以非常快速地找到这两个表中的相交范围,从而生成一个如下所示的表:

结果表:

> resultTable
   CHROMOSOME START STOP
1:          1     5   10
2:          1    20   50
3:          1   100  110
4:          Y     5   50  
这是可能的,还是在data.table中有更好的方法

我还想首先确认,在一个表中,对于任何给定的染色体,停止坐标不与下一行的开始坐标重叠。例如,染色体Y:1-15和Y:10-50需要折叠为染色体Y:1-50(参见第二个表第7行和第8行)。情况不应该是这样,但函数可能应该对此进行检查。下面是一个关于潜在重叠应如何折叠的实际示例:

   CHROM  START   STOP
1:     1 721281 721619
2:     1 721430 721906
3:     1 721751 722042
期望输出:

   CHROM  START   STOP
1:     1 721281 722042
创建示例表的函数如下所示:

table1 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
   START = c(1,20,70,1,5) ,
   STOP = c(10,50,130,20,200)
)

table2 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
   START = c(5,15,60,100,130,60,1,10) ,
   STOP = c(12,55,65,110,131,80,15,50)
 )
table1
foverlaps()
会很好

首先设置两个表的键:

setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
现在使用
foverlaps()
nomatch=0
将它们连接起来,以便在
表2中删除不匹配的行

resultTable <- foverlaps(table1, table2, nomatch = 0)

从停止到未来开始的重叠应该是另一个问题。事实上,我有一个,所以也许我会问它,当我有一个好的答案时,我会回到这里。

如果你没有被困在data.table解决方案上

给予

>库(基因组范围)
>相交(makeGRangesFromDataFrame(表1),makeGRangesFromDataFrame(表2))
具有5个范围和0个元数据列的GRanges对象:
seqnames范围链
[1]        1 [  5,  10]      *
[2]        1 [ 20,  50]      *
[3]        1 [100, 110]      *
[4]        1 [130, 130]      *
[5] Y[5,50]*
-------
seqinfo:3个来自未指定基因组的序列;没有长度

在基因组学中的大多数重叠范围问题中,我们有一个大数据集
x
(通常是序列读取)和另一个小数据集
y
(通常是基因模型、外显子、内含子等)。我们的任务是找出
x
中的哪些区间与
y
中的哪些区间重叠,或者每个
y
区间中
x
中有多少区间重叠

foverlaps()
中,我们不必对更大的数据集
x
执行
setkey()
——这是一个相当昂贵的操作。但是
y
需要设置它的键。对于您的情况,从这个示例来看,
table2
似乎更大=
x
,而
table1
=
y

require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START), 
           i.STOP = pmin(STOP, i.STOP))]

ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
#    CHROMOSOME START STOP  V1  V2
# 1:          1     1   10   5  10
# 2:          1    20   50  20  50
# 3:          1    70  130 100 130
# 4:          Y     5  200   5  50


但我同意能一步完成这件事是很好的。还不确定如何使用,但可能对
mult=
参数使用附加值
reduce
intersect

@Seth使用data.table foverlaps函数提供了解决交叉口重叠问题的最快方法。但是,此解决方案没有考虑到以下事实:输入床文件可能具有重叠范围,需要将其缩小为单个区域@Martin Morgan通过使用Genomic Ranges软件包的解决方案解决了这一问题,该软件包实现了交叉和范围缩小。然而,Martin的解决方案没有使用foverlaps函数@Arun指出,目前不可能使用foverlaps在表中的不同行中重叠范围。感谢提供的答案,以及对stackoverflow的一些额外研究,我提出了这个混合解决方案

在每个文件中创建没有重叠区域的示例床文件

chr <- c(1:22,"X","Y","MT")

#bedA contains 5 million rows
bedA <- data.table(
    CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
    START = rep(as.integer(seq(1,200000000,1000)),25),
    STOP = rep(as.integer(seq(500,200000000,1000)),25),
    key = c("CHROM","START","STOP")
    )

#bedB contains 500 thousand rows
bedB <- data.table(
  CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
  START = rep(as.integer(seq(200,200000000,10000)),25),
  STOP = rep(as.integer(seq(600,200000000,10000)),25),
  key = c("CHROM","START","STOP")
)

chr这里有一个完全基于Pete答案的data.table解决方案。这实际上比他使用基因组范围和data.table的解决方案慢,但仍然比只使用基因组范围的解决方案快

intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[,row_id:=1:nrow(bed)]
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]

  setkeyv(bed,bedKey)
  temp <- foverlaps(bed,bed)

  temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
  temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
  out <- unique(temp[,.(CHROM,START,STOP)])
  setkeyv(out,bedKey)
  out
}

intersectBedFiles.foverlaps2需要额外的聚合步骤才能获得相交范围。@Arun yep,需要聚合步骤。如果你得到答案,请告诉我。我更新了问题,规定需要合并重叠范围。@Arun实际数据集在表1中约有400万行,在表2中约有23万行。我不熟悉列表前面的“.”的用法(例如“(染色体,开始,停止)”)-它的用途是什么。该函数似乎确实聚合了Y染色体上的重叠区域。然而,如果我交换表1和表2,我会得到不同的答案。我希望它能够以任何一种方式工作-任何一个表都可以更长。@Pete,
foverlaps()
旨在通过重叠范围查找/合并。虽然可以使用
foverlaps()
获得您的要求,但这并不简单。我们必须弄清楚如何最好地做到这一点——要么在
foverlaps()
中提供功能,以便在出现多个重叠的情况下获得相交范围,要么像GenomicRanges那样提供另一个功能。“我还没有考虑太多,短期内我将无法处理它。@皮特-同样好奇的是,我搜索了data.table-package,无意中了解到“.”(是list()的别名)。”。这可能值得一行或两行,Arun…@Arun是否有办法在data.table中以本机方式减少间隔的更新?我提出了一个解决方案,但它似乎没有那么有效(见下面的答案),所以这个交叉口正好是一个
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START), 
           i.STOP = pmin(STOP, i.STOP))]

ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
#    CHROMOSOME START STOP  V1  V2
# 1:          1     1   10   5  10
# 2:          1    20   50  20  50
# 3:          1    70  130 100 130
# 4:          Y     5  200   5  50
chr <- c(1:22,"X","Y","MT")

#bedA contains 5 million rows
bedA <- data.table(
    CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
    START = rep(as.integer(seq(1,200000000,1000)),25),
    STOP = rep(as.integer(seq(500,200000000,1000)),25),
    key = c("CHROM","START","STOP")
    )

#bedB contains 500 thousand rows
bedB <- data.table(
  CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
  START = rep(as.integer(seq(200,200000000,10000)),25),
  STOP = rep(as.integer(seq(600,200000000,10000)),25),
  key = c("CHROM","START","STOP")
)
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))

user  system elapsed 
1.25    0.02    1.37 

#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))

user  system elapsed 
12.95    0.06   13.04 

identical(tmpA,tmpB)
[1] TRUE
#Create overlapping ranges
makeOverlaps <-  as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))

user  system elapsed 
1.83    0.05    1.89 

#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))

user  system elapsed 
12.95    0.04   12.99 

identical(tmpC,tmpD)
[1] TRUE
intersectBedFiles.foverlaps <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
    bed <- reduceBed.GenomicRanges(bed)
  }
  return(bed)
}

rowShift <- function(x, shiftLen = 1L) {
  #Note this function was described in this thread:
  #http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

reduceBed.GenomicRanges <- function(bed) {
  setnames(bed,colnames(bed),bedKey)
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  grBed <- makeGRangesFromDataFrame(bed,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grBed <- reduce(grBed)
  grBed <- data.table(
    CHROM=as.character(seqnames(grBed)),
    START=start(grBed),
    STOP=end(grBed),
    key = c("CHROM","START","STOP"))
  return(grBed)
}
intersectBedFiles.GR <- function(bed1,bed2) {
  require(data.table)
  require(GenomicRanges)
  bed1 <- makeGRangesFromDataFrame(bed1,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  bed2 <- makeGRangesFromDataFrame(bed2,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grMerge <- suppressWarnings(intersect(bed1,bed2))
  resultTable <- data.table(
    CHROM=as.character(seqnames(grMerge)),
    START=start(grMerge),
    STOP=end(grMerge),
    key = c("CHROM","START","STOP"))
  return(resultTable)
}
reduceBed.IRanges <- function(bed) {
  bed.tmp <- bed
  bed.tmp[,group := { 
      ir <-  IRanges(START, STOP);
      subjectHits(findOverlaps(ir, reduce(ir)))
    }, by=CHROM]
  bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM), 
              START=min(START), 
              STOP=max(STOP)),
       by=list(group,CHROM)]
  setkeyv(bed.tmp,bedKey)
  bed[,group := NULL]
  return(bed.tmp[, -(1:2)])
}


system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))

user  system elapsed 
10.86    0.01   10.89 

system.time(bedD.reduced <- reduceBed.IRanges(bedC))

user  system elapsed 
137.12    0.14  137.58 

identical(bedC.reduced,bedD.reduced)
[1] TRUE
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[,row_id:=1:nrow(bed)]
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]

  setkeyv(bed,bedKey)
  temp <- foverlaps(bed,bed)

  temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
  temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
  out <- unique(temp[,.(CHROM,START,STOP)])
  setkeyv(out,bedKey)
  out
}