Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
BioConductor IRanges覆盖率计数和识别片段_R - Fatal编程技术网

BioConductor IRanges覆盖率计数和识别片段

BioConductor IRanges覆盖率计数和识别片段,r,R,我有一个数据集,包含一系列制造电路的间隔信息 df <- data.frame(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200, 1393627920, 1393628400, 1393631520, 1393650300, 13936

我有一个数据集,包含一系列制造电路的间隔信息

df <- data.frame(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L, 
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200, 
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000, 
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""), 
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000, 
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001", 
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end", 
"id"), class = "data.frame", row.names = c(NA, -9L)))

问题是,在我的完整数据集上,使用不平等联接的
sqldf
方法速度慢得令人无法忍受

我如何单独使用
IRanges
就能得到类似的东西?

我怀疑这与
RangedData
有关,但我一直不知道如何得到我想要的。这是我试过的

rd <- RangedData(ir, circuit = df$circuit, id = df$id)
coverage(rd) # works but seems to lose the circuit/id info

rd覆盖范围可以表示为范围,去掉第一个(从1970年到第一个起点的范围)

实际电路是

df[subjectHits(olaps), c("circuit", "id")]
这些碎片也许可以编织在一起

df1 <- cbind(uid=seq_along(intervals),
             as.data.frame(intervals),
             circuits_running=tabulate(queryHits(olaps), queryLength(olaps)))
df2 <- cbind(uid=queryHits(olaps),
             df[subjectHits(olaps), c("circuit", "id")])
merge(df1, df2, by="uid", all=TRUE)

df1真有趣。。。在谷歌采访中,我被问到了类似的问题,最佳答案是巧妙地使用了一个最大堆,总共运行了
O(n lnc)
运行时(其中
n
是记录数,
c
是同时运行的电路的最大数量)。如果它运行缓慢,请在查询中添加索引。请参阅sqldf主页以获取示例。感谢您的回答Martin-使用您概述的方法将所有内容编织在一起非常有效。
cov <- coverage(ir)
intervals <- ranges(cov)[-1]
olaps <- findOverlaps(narrow(intervals, width(intervals)), ir)
tabulate(queryHits(olaps), queryLength(olaps))
df[subjectHits(olaps), c("circuit", "id")]
df1 <- cbind(uid=seq_along(intervals),
             as.data.frame(intervals),
             circuits_running=tabulate(queryHits(olaps), queryLength(olaps)))
df2 <- cbind(uid=queryHits(olaps),
             df[subjectHits(olaps), c("circuit", "id")])
merge(df1, df2, by="uid", all=TRUE)
ir <- IRanges(start = as.numeric(df$start), end = as.numeric(df$end))
mcols(ir) <- DataFrame(df)
## ...
mcols(ir[subjectHits(olaps)])