R 基因组范围增加了覆盖范围
我正在研究RNA-seq数据,并试图按基因型绘制平均覆盖率曲线,类似于这里所做的 每个基因型的RNA序列覆盖率(来源:pickrell等人,《自然》杂志,2010) 为了绘制这个图,我有来自100个个体的bigwig文件,其中包含来自RNA序列数据(在特定区域)的覆盖信息,我在R中读取这些信息,作为基因组范围对象 这将为我提供GRanges对象,例如在以下玩具示例中获得的对象: gr1=GRanges(seqname=1,range=IRanges(start=c(1,5,10,15,30,55),end=c(4,9,14,29,39,60))) gr1$cov=c(3,1,8,6,2,10) gr2=GRanges(seqname=1,range=IRanges(start=c(3,20,24),end=c(7,23,26))) gr2$cov=c(3,5,3) 开始=唯一(排序(c(范围(gr1)@start,范围(gr2)@start))) gr1R 基因组范围增加了覆盖范围,r,R,我正在研究RNA-seq数据,并试图按基因型绘制平均覆盖率曲线,类似于这里所做的 每个基因型的RNA序列覆盖率(来源:pickrell等人,《自然》杂志,2010) 为了绘制这个图,我有来自100个个体的bigwig文件,其中包含来自RNA序列数据(在特定区域)的覆盖信息,我在R中读取这些信息,作为基因组范围对象 这将为我提供GRanges对象,例如在以下玩具示例中获得的对象: gr1=GRanges(seqname=1,range=IRanges(start=c(1,5,10,15,30,5
GRanges对象具有6个范围和1个元数据列:
SeqName系列钢绞线| cov
|
1 [ 1, 4] * | 3
1 [ 5, 9] * | 1
1 [10, 14] * | 8
1 [15, 29] * | 6
1 [30, 39] * | 2
1 [55, 60] * | 10
-------
seqinfo:1个来自未指定基因组的序列;没有长度
gr2
GRanges对象具有3个范围和1个元数据列:
SeqName系列钢绞线| cov
|
1 [ 3, 7] * | 3
1 [20, 23] * | 5
1 [24, 26] * | 3
-------
seqinfo:1个来自未指定基因组的序列;没有长度
问题是我每个个体都有这些(gr1和gr2是两个不同的个体),我想将它们结合起来,创建一个基因组范围对象,它为我提供了个体1和2的每个位置的总覆盖率
这将如下所示:
gr3
GRanges对象具有6个范围和1个元数据列:
SeqName系列钢绞线| cov
|
1 [ 1, 2] * | 3
1 [ 3, 4] * | 6 (=3+3)
1 [ 5, 7] * | 4 (=1+3)
1 [ 8, 9] * | 1
1 [10, 14] * | 8
1 [15, 19] * | 6
1 [20, 23] * | 11 (=6+5)
1 [24, 26] * | 9 (=6+3)
1 [27, 29] * | 6
1 [30, 39] * | 2
1 [55, 60] * | 10
有人知道一个简单的方法吗?还是我注定了
谢谢你的回答
附言:
我的数据不是搁浅的,但如果你有搁浅的数据,那就更好了
PPS:理想情况下,我也希望能够计算乘法,或应用任何具有两个参数x和y的函数,而不是简单地增加覆盖率。已经快一年了,但以下是我的答案供将来参考 每当我找不到一个函数来直接执行这样的任务时,我只需将
GRanges
对象展开为单个bp分辨率。这允许我对元数据列执行任何必需的操作,将它们视为简单的data.frame
列,因为IRanges
现在在两个Granges
对象之间匹配
在这个问题的具体情况下,以下工作
### Sort seqlevels
# (not necessary here, but in real world examples,
# with multiple sequences, you will want to do this)
gr1 <- sort(GenomeInfoDb::sortSeqlevels(gr1))
gr2 <- sort(GenomeInfoDb::sortSeqlevels(gr2))
### Add seqlengths
# (this corresponds to the actual sequence lengths;
# here we use the highest position between the two objects: 60)
seqlengths(gr1) <- 60
### Make 1-bp tiles covering the genome
# (using either one of gr1 and gr2 as a reference)
bins <- GenomicRanges::tileGenome(GenomeInfoDb::seqlengths(gr1),
tilewidth=1,
cut.last.tile.in.chrom=TRUE)
### Get coverage signal as Rle object
gr1_cov <- coverage(gr1, weight="cov")
gr2_cov <- coverage(gr2, weight="cov")
### Get average coverage in each bin
# (since the bins are 1-bp wide, this just keeps the original coverage value)
gr1_bins <- GenomicRanges::binnedAverage(bins, gr1_cov, "binned_cov")
gr2_bins <- GenomicRanges::binnedAverage(bins, gr2_cov, "binned_cov")
### Make final object:
# We can now sum the values in the metadata columns
# Addressing the PPS, you could do any other operation or apply a function
gr3 <- gr1_bins
gr3$binned_cov <- gr1_bins$binned_cov + gr2_bins$binned_cov
要压缩它并获得问题中的确切gr3
,我们可以执行以下操作
### Compress back to variable-width IRanges (by cov)
gr3_Rle <- coverage(gr3, weight='binned_cov')
gr3 <- as(gr3_Rle, "GRanges")
### Drop 0-score rows
gr3 <- gr3[gr3$score > 0]
### Rename metadata column
names(mcols(gr3)) <- 'cov'
> gr3
GRanges object with 11 ranges and 1 metadata column:
seqnames ranges strand | cov
<Rle> <IRanges> <Rle> | <numeric>
[1] 1 [ 1, 2] * | 3
[2] 1 [ 3, 4] * | 6
[3] 1 [ 5, 7] * | 4
[4] 1 [ 8, 9] * | 1
[5] 1 [10, 14] * | 8
[6] 1 [15, 19] * | 6
[7] 1 [20, 23] * | 11
[8] 1 [24, 26] * | 9
[9] 1 [27, 29] * | 6
[10] 1 [30, 39] * | 2
[11] 1 [55, 60] * | 10
-------
seqinfo: 1 sequence from an unspecified genome
####压缩回可变宽度IRanges(通过cov)
gr3_Rle
GRanges object with 6 ranges and 1 metadata column:
seqnames ranges strand | cov
<Rle> <IRanges> <Rle> | <numeric>
1 [ 1, 2] * | 3
1 [ 3, 4] * | 6 (=3+3)
1 [ 5, 7] * | 4 (=1+3)
1 [ 8, 9] * | 1
1 [10, 14] * | 8
1 [15, 19] * | 6
1 [20, 23] * | 11 (=6+5)
1 [24, 26] * | 9 (=6+3)
1 [27, 29] * | 6
1 [30, 39] * | 2
1 [55, 60] * | 10
### Sort seqlevels
# (not necessary here, but in real world examples,
# with multiple sequences, you will want to do this)
gr1 <- sort(GenomeInfoDb::sortSeqlevels(gr1))
gr2 <- sort(GenomeInfoDb::sortSeqlevels(gr2))
### Add seqlengths
# (this corresponds to the actual sequence lengths;
# here we use the highest position between the two objects: 60)
seqlengths(gr1) <- 60
### Make 1-bp tiles covering the genome
# (using either one of gr1 and gr2 as a reference)
bins <- GenomicRanges::tileGenome(GenomeInfoDb::seqlengths(gr1),
tilewidth=1,
cut.last.tile.in.chrom=TRUE)
### Get coverage signal as Rle object
gr1_cov <- coverage(gr1, weight="cov")
gr2_cov <- coverage(gr2, weight="cov")
### Get average coverage in each bin
# (since the bins are 1-bp wide, this just keeps the original coverage value)
gr1_bins <- GenomicRanges::binnedAverage(bins, gr1_cov, "binned_cov")
gr2_bins <- GenomicRanges::binnedAverage(bins, gr2_cov, "binned_cov")
### Make final object:
# We can now sum the values in the metadata columns
# Addressing the PPS, you could do any other operation or apply a function
gr3 <- gr1_bins
gr3$binned_cov <- gr1_bins$binned_cov + gr2_bins$binned_cov
> gr3
GRanges object with 60 ranges and 1 metadata column:
seqnames ranges strand | binned_cov
<Rle> <IRanges> <Rle> | <numeric>
[1] 1 [1, 1] * | 3
[2] 1 [2, 2] * | 3
[3] 1 [3, 3] * | 6
[4] 1 [4, 4] * | 6
[5] 1 [5, 5] * | 4
... ... ... ... . ...
[56] 1 [56, 56] * | 10
[57] 1 [57, 57] * | 10
[58] 1 [58, 58] * | 10
[59] 1 [59, 59] * | 10
[60] 1 [60, 60] * | 10
-------
seqinfo: 1 sequence from an unspecified genome
### Compress back to variable-width IRanges (by cov)
gr3_Rle <- coverage(gr3, weight='binned_cov')
gr3 <- as(gr3_Rle, "GRanges")
### Drop 0-score rows
gr3 <- gr3[gr3$score > 0]
### Rename metadata column
names(mcols(gr3)) <- 'cov'
> gr3
GRanges object with 11 ranges and 1 metadata column:
seqnames ranges strand | cov
<Rle> <IRanges> <Rle> | <numeric>
[1] 1 [ 1, 2] * | 3
[2] 1 [ 3, 4] * | 6
[3] 1 [ 5, 7] * | 4
[4] 1 [ 8, 9] * | 1
[5] 1 [10, 14] * | 8
[6] 1 [15, 19] * | 6
[7] 1 [20, 23] * | 11
[8] 1 [24, 26] * | 9
[9] 1 [27, 29] * | 6
[10] 1 [30, 39] * | 2
[11] 1 [55, 60] * | 10
-------
seqinfo: 1 sequence from an unspecified genome