如何在R中按列值范围筛选行?
我有两个基因数据集。一个定义了每行基因组的范围,另一个数据集是基因长度范围的行,我想确保它与第一个数据集中的范围没有任何重叠 例如,我的数据如下所示:如何在R中按列值范围筛选行?,r,data.table,bioinformatics,overlap,R,Data.table,Bioinformatics,Overlap,我有两个基因数据集。一个定义了每行基因组的范围,另一个数据集是基因长度范围的行,我想确保它与第一个数据集中的范围没有任何重叠 例如,我的数据如下所示: #df1: Chromosome Min Max 1 10 500 1 450 550 2 20 100 2 900 1500 2 200 21
#df1:
Chromosome Min Max
1 10 500
1 450 550
2 20 100
2 900 1500
2 200 210
3 5 15
4 10 20
我想拉出/选择df2中没有Gene.Start和Gene.End范围的行,其中该范围内的任何内容都在df1中最小和最大列中给出的范围内-重要的是,考虑到染色体数也必须匹配
示例的预期输出如下所示:
Gene Gene.Start Gene.End Chromosome
Gene2 950 990 1
Gene2是唯一一个起始和结束范围不在任何范围内的基因/行,匹配的染色体在df1的染色体1上寻找范围
为了编写此代码,我尝试使用data.table,但我不确定如何将范围视为我想要的范围
我一直在努力让它工作,但我不确定我在做什么:
df2[df1, match := i.Gene,
on = .(Chromosome, (df2$Gene.Start > & < df2$Gene.End) > Min, (df2$Gene.Start > & < df2$Gene.End) < Max)]
Error: unexpected '&'
如何根据另一个数据帧中的范围按其范围过滤数据帧
输入数据示例:
df1 <- structure(list(Chromosome = c(1L, 1L, 2L, 2L, 2L, 3L, 4L), Min = c(10L,
450L, 20L, 900L, 200L, 5L, 10L), Max = c(500L, 550L, 100L, 1500L,
210L, 15L, 20L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
这是一种data.table方法
下面是我对dplyr方法的尝试。请让我知道
library(dplyr)
library(tidyr)
df2 %>%
right_join(df1, by = "Chromosome") %>%
filter(Gene.Start<Min | Gene.Start>Max, Gene.End>Max | Gene.End>Min) %>%
distinct(Gene, Gene.Start, Gene.End, Chromosome, .keep_all = TRUE) %>%
select(Gene, Gene.Start, Gene.End, Chromosome)
data.table解决方案效果最好,因为它在我的大得多的真实数据上速度最快,但我最终还是找到了另一个具有基因组范围的解决方案,所以我想我也会分享给其他人,以供将来参考:
library(GenomicRanges)
gr1 <- makeGRangesFromDataFrame(
data.frame(
chr=df1$Chromosome,
start=df1$Min,
end=df1$Max),
keep.extra.columns=TRUE)
gr2 <- makeGRangesFromDataFrame(
data.frame(
chr=df2$Chromosome,
start=df2$Gene.Start,
end=df2$Gene.End,
Gene = df2$Gene),
keep.extra.columns=TRUE)
no_overlaps <- gr2[-queryHits(findOverlaps(gr2, gr1, type="any")),]
no_overlap_genes <- unique(no_overlaps$Gene)
gene_matches <- df2[Gene %in% no_overlap_genes]
不是R,但bedtools就是为此而设计的。在R中,您可以尝试基因组范围,或者在data.table中,特别是在foverlaps函数中。
library(data.table)
# keep Gene that are not joined in the non-equi join on df1 below
df2[!Gene %in% df2[df1, on = .(Chromosome, Gene.Start >= Min, Gene.End <= Max)]$Gene, ]
# Gene Gene.Start Gene.End Chromosome
# 1: Gene2 950 990 1
library(dplyr)
library(tidyr)
df2 %>%
right_join(df1, by = "Chromosome") %>%
filter(Gene.Start<Min | Gene.Start>Max, Gene.End>Max | Gene.End>Min) %>%
distinct(Gene, Gene.Start, Gene.End, Chromosome, .keep_all = TRUE) %>%
select(Gene, Gene.Start, Gene.End, Chromosome)
Gene Gene.Start Gene.End Chromosome
1 Gene2 950 990 1
library(GenomicRanges)
gr1 <- makeGRangesFromDataFrame(
data.frame(
chr=df1$Chromosome,
start=df1$Min,
end=df1$Max),
keep.extra.columns=TRUE)
gr2 <- makeGRangesFromDataFrame(
data.frame(
chr=df2$Chromosome,
start=df2$Gene.Start,
end=df2$Gene.End,
Gene = df2$Gene),
keep.extra.columns=TRUE)
no_overlaps <- gr2[-queryHits(findOverlaps(gr2, gr1, type="any")),]
no_overlap_genes <- unique(no_overlaps$Gene)
gene_matches <- df2[Gene %in% no_overlap_genes]