使用data.table函数foverlaps查找两个表中重叠范围的交点
我想使用foverlaps查找两个床文件的相交范围,并将包含重叠范围的任何行折叠为一行。在下面的例子中,我有两个带有基因组范围的表格。这些表格被称为“bed”文件,其中染色体中特征的起始坐标为零,结束位置为一。例如,START=9,STOP=20被解释为跨越基数10到20,包括10到20。这些床文件可以包含数百万行。解决方案需要给出相同的结果,而不管提供的两个文件相交的顺序如何 第一桌使用data.table函数foverlaps查找两个表中重叠范围的交点,r,data.table,bioinformatics,R,Data.table,Bioinformatics,我想使用foverlaps查找两个床文件的相交范围,并将包含重叠范围的任何行折叠为一行。在下面的例子中,我有两个带有基因组范围的表格。这些表格被称为“bed”文件,其中染色体中特征的起始坐标为零,结束位置为一。例如,START=9,STOP=20被解释为跨越基数10到20,包括10到20。这些床文件可以包含数百万行。解决方案需要给出相同的结果,而不管提供的两个文件相交的顺序如何 第一桌 > table1 CHROMOSOME START STOP 1: 1
> table1
CHROMOSOME START STOP
1: 1 1 10
2: 1 20 50
3: 1 70 130
4: X 1 20
5: Y 5 200
第二桌
> table2
CHROMOSOME START STOP
1: 1 5 12
2: 1 15 55
3: 1 60 65
4: 1 100 110
5: 1 130 131
6: X 60 80
7: Y 1 15
8: Y 10 50
我认为新的foverlaps函数可以非常快速地找到这两个表中的相交范围,从而生成一个如下所示的表:
结果表:
> resultTable
CHROMOSOME START STOP
1: 1 5 10
2: 1 20 50
3: 1 100 110
4: Y 5 50
这是可能的,还是在data.table中有更好的方法
我还想首先确认,在一个表中,对于任何给定的染色体,停止坐标不与下一行的开始坐标重叠。例如,染色体Y:1-15和Y:10-50需要折叠为染色体Y:1-50(参见第二个表第7行和第8行)。情况不应该是这样,但函数可能应该对此进行检查。下面是一个关于潜在重叠应如何折叠的实际示例:
CHROM START STOP
1: 1 721281 721619
2: 1 721430 721906
3: 1 721751 722042
期望输出:
CHROM START STOP
1: 1 721281 722042
创建示例表的函数如下所示:
table1 <- data.table(
CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
START = c(1,20,70,1,5) ,
STOP = c(10,50,130,20,200)
)
table2 <- data.table(
CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
START = c(5,15,60,100,130,60,1,10) ,
STOP = c(12,55,65,110,131,80,15,50)
)
table1foverlaps()
会很好
首先设置两个表的键:
setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)
现在使用foverlaps()
和nomatch=0
将它们连接起来,以便在表2中删除不匹配的行
resultTable <- foverlaps(table1, table2, nomatch = 0)
从停止到未来开始的重叠应该是另一个问题。事实上,我有一个,所以也许我会问它,当我有一个好的答案时,我会回到这里。如果你没有被困在data.table解决方案上
给予
>库(基因组范围)
>相交(makeGRangesFromDataFrame(表1),makeGRangesFromDataFrame(表2))
具有5个范围和0个元数据列的GRanges对象:
seqnames范围链
[1] 1 [ 5, 10] *
[2] 1 [ 20, 50] *
[3] 1 [100, 110] *
[4] 1 [130, 130] *
[5] Y[5,50]*
-------
seqinfo:3个来自未指定基因组的序列;没有长度
在基因组学中的大多数重叠范围问题中,我们有一个大数据集x
(通常是序列读取)和另一个小数据集y
(通常是基因模型、外显子、内含子等)。我们的任务是找出x
中的哪些区间与y
中的哪些区间重叠,或者每个y
区间中x
中有多少区间重叠
在foverlaps()
中,我们不必对更大的数据集x
执行setkey()
——这是一个相当昂贵的操作。但是y
需要设置它的键。对于您的情况,从这个示例来看,table2
似乎更大=x
,而table1
=y
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
但我同意能一步完成这件事是很好的。还不确定如何使用,但可能对mult=
参数使用附加值reduce
和intersect
@Seth使用data.table foverlaps函数提供了解决交叉口重叠问题的最快方法。但是,此解决方案没有考虑到以下事实:输入床文件可能具有重叠范围,需要将其缩小为单个区域@Martin Morgan通过使用Genomic Ranges软件包的解决方案解决了这一问题,该软件包实现了交叉和范围缩小。然而,Martin的解决方案没有使用foverlaps函数@Arun指出,目前不可能使用foverlaps在表中的不同行中重叠范围。感谢提供的答案,以及对stackoverflow的一些额外研究,我提出了这个混合解决方案
在每个文件中创建没有重叠区域的示例床文件
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
chr这里有一个完全基于Pete答案的data.table解决方案。这实际上比他使用基因组范围和data.table的解决方案慢,但仍然比只使用基因组范围的解决方案快
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}
intersectBedFiles.foverlaps2需要额外的聚合步骤才能获得相交范围。@Arun yep,需要聚合步骤。如果你得到答案,请告诉我。我更新了问题,规定需要合并重叠范围。@Arun实际数据集在表1中约有400万行,在表2中约有23万行。我不熟悉列表前面的“.”的用法(例如“(染色体,开始,停止)”)-它的用途是什么。该函数似乎确实聚合了Y染色体上的重叠区域。然而,如果我交换表1和表2,我会得到不同的答案。我希望它能够以任何一种方式工作-任何一个表都可以更长。@Pete,foverlaps()
旨在通过重叠范围查找/合并。虽然可以使用foverlaps()
获得您的要求,但这并不简单。我们必须弄清楚如何最好地做到这一点——要么在foverlaps()
中提供功能,以便在出现多个重叠的情况下获得相交范围,要么像GenomicRanges那样提供另一个功能。“我还没有考虑太多,短期内我将无法处理它。@皮特-同样好奇的是,我搜索了data.table-package,无意中了解到“.”(是list()的别名)。”。这可能值得一行或两行,Arun…@Arun是否有办法在data.table中以本机方式减少间隔的更新?我提出了一个解决方案,但它似乎没有那么有效(见下面的答案),所以这个交叉口正好是一个
require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START),
i.STOP = pmin(STOP, i.STOP))]
ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
# CHROMOSOME START STOP V1 V2
# 1: 1 1 10 5 10
# 2: 1 20 50 20 50
# 3: 1 70 130 100 130
# 4: Y 5 200 5 50
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[,row_id:=1:nrow(bed)]
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
setkeyv(bed,bedKey)
temp <- foverlaps(bed,bed)
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
out <- unique(temp[,.(CHROM,START,STOP)])
setkeyv(out,bedKey)
out
}