R中基于多列的有条件行删除

R中基于多列的有条件行删除,r,dataframe,R,Dataframe,我有一个4列500多行的数据框。我希望根据多列有条件地从数据框中删除行 df.original chr start end type 1 chrI 232613 232625 ins 2 chrI 834151 834151 snp 3 chrI 834161 834161 snp 4 chrI 834171 834177 del 5 chrI 1123752 1123805 del 6 chrI 1377649 13776

我有一个4列500多行的数据框。我希望根据多列有条件地从数据框中删除行

df.original
   chr   start    end   type     
1 chrI  232613  232625  ins  
2 chrI  834151  834151  snp  
3 chrI  834161  834161  snp  
4 chrI  834171  834177  del 
5 chrI 1123752 1123805  del 
6 chrI 1377649 1377649  snp 
我想做的是查看每一行,看看snp、ins、del和chr类型是否与另一行匹配。如果这个条件是真的,我想看看起始位置和结束位置。如果起始和结束位置与任何其他行的距离均为+-50,我希望删除它和+-50的其他行

df.new
   chr   start    end   type     
1 chrI  232613  232625  ins  
2 chrI  834171  834177  del 
3 chrI 1123752 1123805  del 
4 chrI 1377649 1377649  snp 
在新的数据帧中,原始的第2行和第3行都被删除,因为它们位于相同的chr上,相同的类型,并且起始位置和结束位置彼此相差+-50


谢谢

也许这会有用。您可以按类型分组,然后计算元素的数量以及开始/结束变量之间的差异。之后,您可以创建一个标志变量来标识要删除和筛选的值。下面是使用dplyr的代码:

输出:

# A tibble: 4 x 4
  chr     start     end type 
  <chr>   <int>   <int> <chr>
1 chrI   232613  232625 ins  
2 chrI   834171  834177 del  
3 chrI  1123752 1123805 del  
4 chrI  1377649 1377649 snp  

考虑到您正在处理以整数范围表示的基因,利用bioconductor的GRanges和IRanges软件包可能是理想的选择

library(IRanges)
library(tidyverse) 

#Turn your data.frame into S4 object IRanges
IR <- IRanges(
  start = c(232613, 834151, 834161, 834171, 1123752, 1377649),
  end = c(232625, 834151, 834161, 834177, 1123805, 1377649),
  type = c("ins", "snp", "snp", "del", "del", "snp")
)
library(IRanges)
library(tidyverse) 

#Turn your data.frame into S4 object IRanges
IR <- IRanges(
  start = c(232613, 834151, 834161, 834171, 1123752, 1377649),
  end = c(232625, 834151, 834161, 834177, 1123805, 1377649),
  type = c("ins", "snp", "snp", "del", "del", "snp")
)
IR
>IRanges object with 6 ranges and 1 metadata column:
          start       end     width |        type
      <integer> <integer> <integer> | <character>
  [1]    232613    232625        13 |         ins
  [2]    834151    834151         1 |         snp
  [3]    834161    834161         1 |         snp
  [4]    834171    834177         7 |         del
  [5]   1123752   1123805        54 |         del
  [6]   1377649   1377649         1 |         snp
types <- mcols(IR)$type %>% unique()

#A loop (less than ideal) to make each 'type' a element of a list
list.IR <- list()
for(i in 1:length(types)){
  list.IR[i] <- IR[mcols(IR)$type == types[i]]
}
#create a function that removes IRanges with more than one overlap (ie, other than itself)
ovlp_rm <- function(IR){
  IR.flank <- flank(IR, width = 25, both = T)
  n_ovlp <- countOverlaps(IR.flank)
  indx_no.ovlp <- n_ovlp == 1
  return(IR[indx_no.ovlp])
}
#apply the function on your list of IRanges, organized by type
lapply(list.IR, FUN = ovlp_rm) 

>
> lapply(list.IR, ovlp_rm)
[[1]]
IRanges object with 1 range and 1 metadata column:
          start       end     width |        type
      <integer> <integer> <integer> | <character>
  [1]    232613    232625        13 |         ins

[[2]]
IRanges object with 1 range and 1 metadata column:
          start       end     width |        type
      <integer> <integer> <integer> | <character>
  [1]   1377649   1377649         1 |         snp

[[3]]
IRanges object with 2 ranges and 1 metadata column:
          start       end     width |        type
      <integer> <integer> <integer> | <character>
  [1]    834171    834177         7 |         del
  [2]   1123752   1123805        54 |         del