R中基于多列的有条件行删除
我有一个4列500多行的数据框。我希望根据多列有条件地从数据框中删除行R中基于多列的有条件行删除,r,dataframe,R,Dataframe,我有一个4列500多行的数据框。我希望根据多列有条件地从数据框中删除行 df.original chr start end type 1 chrI 232613 232625 ins 2 chrI 834151 834151 snp 3 chrI 834161 834161 snp 4 chrI 834171 834177 del 5 chrI 1123752 1123805 del 6 chrI 1377649 13776
df.original
chr start end type
1 chrI 232613 232625 ins
2 chrI 834151 834151 snp
3 chrI 834161 834161 snp
4 chrI 834171 834177 del
5 chrI 1123752 1123805 del
6 chrI 1377649 1377649 snp
我想做的是查看每一行,看看snp、ins、del和chr类型是否与另一行匹配。如果这个条件是真的,我想看看起始位置和结束位置。如果起始和结束位置与任何其他行的距离均为+-50,我希望删除它和+-50的其他行
df.new
chr start end type
1 chrI 232613 232625 ins
2 chrI 834171 834177 del
3 chrI 1123752 1123805 del
4 chrI 1377649 1377649 snp
在新的数据帧中,原始的第2行和第3行都被删除,因为它们位于相同的chr上,相同的类型,并且起始位置和结束位置彼此相差+-50
谢谢也许这会有用。您可以按类型分组,然后计算元素的数量以及开始/结束变量之间的差异。之后,您可以创建一个标志变量来标识要删除和筛选的值。下面是使用dplyr的代码: 输出:
# A tibble: 4 x 4
chr start end type
<chr> <int> <int> <chr>
1 chrI 232613 232625 ins
2 chrI 834171 834177 del
3 chrI 1123752 1123805 del
4 chrI 1377649 1377649 snp
考虑到您正在处理以整数范围表示的基因,利用bioconductor的GRanges和IRanges软件包可能是理想的选择
library(IRanges)
library(tidyverse)
#Turn your data.frame into S4 object IRanges
IR <- IRanges(
start = c(232613, 834151, 834161, 834171, 1123752, 1377649),
end = c(232625, 834151, 834161, 834177, 1123805, 1377649),
type = c("ins", "snp", "snp", "del", "del", "snp")
)
library(IRanges)
library(tidyverse)
#Turn your data.frame into S4 object IRanges
IR <- IRanges(
start = c(232613, 834151, 834161, 834171, 1123752, 1377649),
end = c(232625, 834151, 834161, 834177, 1123805, 1377649),
type = c("ins", "snp", "snp", "del", "del", "snp")
)
IR
>IRanges object with 6 ranges and 1 metadata column:
start end width | type
<integer> <integer> <integer> | <character>
[1] 232613 232625 13 | ins
[2] 834151 834151 1 | snp
[3] 834161 834161 1 | snp
[4] 834171 834177 7 | del
[5] 1123752 1123805 54 | del
[6] 1377649 1377649 1 | snp
types <- mcols(IR)$type %>% unique()
#A loop (less than ideal) to make each 'type' a element of a list
list.IR <- list()
for(i in 1:length(types)){
list.IR[i] <- IR[mcols(IR)$type == types[i]]
}
#create a function that removes IRanges with more than one overlap (ie, other than itself)
ovlp_rm <- function(IR){
IR.flank <- flank(IR, width = 25, both = T)
n_ovlp <- countOverlaps(IR.flank)
indx_no.ovlp <- n_ovlp == 1
return(IR[indx_no.ovlp])
}
#apply the function on your list of IRanges, organized by type
lapply(list.IR, FUN = ovlp_rm)
>
> lapply(list.IR, ovlp_rm)
[[1]]
IRanges object with 1 range and 1 metadata column:
start end width | type
<integer> <integer> <integer> | <character>
[1] 232613 232625 13 | ins
[[2]]
IRanges object with 1 range and 1 metadata column:
start end width | type
<integer> <integer> <integer> | <character>
[1] 1377649 1377649 1 | snp
[[3]]
IRanges object with 2 ranges and 1 metadata column:
start end width | type
<integer> <integer> <integer> | <character>
[1] 834171 834177 7 | del
[2] 1123752 1123805 54 | del