Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/database/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在R中获取两个文件之间的交集_R_Intervals_Intersect - Fatal编程技术网

在R中获取两个文件之间的交集

在R中获取两个文件之间的交集,r,intervals,intersect,R,Intervals,Intersect,我有人类外显子的位置(染色体数目、外显子的开始和结束) 我还有一个类似的文件 > head(a2) SampleID Chromosome Start End 1 sampel1 1 64613 5707515 2 sampel1 1 5712940 5732322 3 sampel1 1 5732399 16383682 4 sampel1 1 16383742 1638

我有人类外显子的位置(染色体数目、外显子的开始和结束)

我还有一个类似的文件

  > head(a2)
  SampleID Chromosome    Start      End
1  sampel1          1    64613  5707515
2  sampel1          1  5712940  5732322
3  sampel1          1  5732399 16383682
4  sampel1          1 16383742 16389288
5  sampel1          1 16390813 16830026
6  sampel1          1 16830201 17278112
> str(a2)
'data.frame':   7 obs. of  4 variables:
 $ SampleID  : chr  "sampel1" "sampe1" "sampel1" "sampel1" ...
 $ Chromosome: int  1 1 1 1 1 1 1
 $ Start     : int  64613 5712940 5732399 16383742 16390813 16830201 17284498
 $ End       : int  5707515 5732322 16383682 16389288 16830026 17278112 120374803
> dput(a2)
structure(list(SampleID = c("sampel1", "sampe1", "sampel1", "sampel1", 
"sampel1", "sampel1", "sampel1"), Chromosome = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L), Start = c(64613L, 5712940L, 5732399L, 16383742L, 
16390813L, 16830201L, 17284498L), End = c(5707515L, 5732322L, 
16383682L, 16389288L, 16830026L, 17278112L, 120374803L)), class = "data.frame", row.names = c(NA, 
-7L))
>
我想知道在第二个文件的间隔中有多少个外显子

让我们在第二个文件中说明
64613
5707515

我的欲望输出是这样的

您正在寻找该软件包:

库(基因组范围)

a1.GRanges您能否发布这些数据的可复制版本。第二个文件中第二行的第一列应该是sample1吗?您对这两个示例表的期望输出是什么?您是否了解
GRanges
如何与“data.table”中的
foverlaps
进行比较?这是我想到的另一种方法。@A5C1D2H2I1M1N2O1R2T1我没有对
foverlaps
进行基准测试,但我使用了基因组规模数据上的
GRanges::findOverlaps
来评估Alu元素。它适用于任何适合内存的东西,在我的例子中大约是100 GB。很抱歉,为什么我会出现此错误?>a2.田庄
  > head(a2)
  SampleID Chromosome    Start      End
1  sampel1          1    64613  5707515
2  sampel1          1  5712940  5732322
3  sampel1          1  5732399 16383682
4  sampel1          1 16383742 16389288
5  sampel1          1 16390813 16830026
6  sampel1          1 16830201 17278112
> str(a2)
'data.frame':   7 obs. of  4 variables:
 $ SampleID  : chr  "sampel1" "sampe1" "sampel1" "sampel1" ...
 $ Chromosome: int  1 1 1 1 1 1 1
 $ Start     : int  64613 5712940 5732399 16383742 16390813 16830201 17284498
 $ End       : int  5707515 5732322 16383682 16389288 16830026 17278112 120374803
> dput(a2)
structure(list(SampleID = c("sampel1", "sampe1", "sampel1", "sampel1", 
"sampel1", "sampel1", "sampel1"), Chromosome = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L), Start = c(64613L, 5712940L, 5732399L, 16383742L, 
16390813L, 16830201L, 17284498L), End = c(5707515L, 5732322L, 
16383682L, 16389288L, 16830026L, 17278112L, 120374803L)), class = "data.frame", row.names = c(NA, 
-7L))
>
library(GenomicRanges)
a1.GRanges <- GRanges(a1$Chromosome,
                         ranges = IRanges(a1$Start, a1$End),
                         seqinfo = a1$exons)

a2.GRanges <- GRanges(a2$Chromosome,
                                ranges = IRanges(a2$Start, a2$End),
                                seqinfo = a2$SampleID)

findOverlaps(a2.GRanges,a1.GRanges)