从data.frame中删除行

从data.frame中删除行,r,dataframe,R,Dataframe,我有一个例子data.frame: df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), level = c(1,5,2,3,6,4,2,1,1)) > df id start end level 1 a

我有一个例子
data.frame

df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), level = c(1,5,2,3,6,4,2,1,1))

> df
     id start end level
1     a   100 150     1
2 a,b,c   100 350     5
3   d,e   400 550     2
4     d   400 450     3
5     h   800 850     6
6     e   500 550     4
7     i   900 950     2
8     b   200 250     1
9     c   300 350     1
方法1(ID值) 因此,如果我们可以假设所有“合并”组都有ID名称,这些ID名称是单个组的逗号分隔列表,那么我们可以通过查看ID来解决这个问题,而忽略开始/结束信息。这里有一个这样的方法

首先,通过查找带有逗号的ID来查找所有“合并”组

groups<-Filter(function(x) length(x)>1, 
    setNames(strsplit(as.character(df$id),","),df$id))
方法2(开始/结束图) 我还想尝试一种方法,忽略(非常有用的)合并ID名称,只查看开始/结束位置。我可能走错了方向,但这让我认为这是一个网络/图形类型的问题,所以我使用了
igraph

我创建了一个图,其中每个顶点表示一个开始/结束位置。因此,每条边代表一个范围。我使用了样本数据集中的所有范围,并填充了任何缺失的范围,以使图形连接起来。我将这些数据合并在一起创建了一个边缘列表。对于每个边,我都记得原始数据集中的“level”和“id”值。下面是实现这一点的代码

library(igraph)

poslist<-sort(unique(c(df$start, df$end)))
seq.el<-embed(rev(poslist),2)
class(seq.el)<-"character"
colnames(seq.el)<-c("start","end")

el<-rbind(df[,c("start","end","level", "id")],data.frame(seq.el, level=0, id=""))
el<-el[!duplicated(el[,1:2]),]

gg<-graph.data.frame(el)
方法3(重叠矩阵) 他是看待起停位置的另一种方式。我创建了一个矩阵,其中列对应于data.frame行中的范围,矩阵行对应于位置。如果范围与位置重叠,则矩阵中的每个值都为真。这里我使用helper函数

#find unique positions and create overlap matrix
un<-sort(unique(unlist(df[,2:3])))    
cc<-sapply(1:nrow(df), function(i) between(un, df$start[i], df$end[i]))

#partition into non-overlapping sections
groups<-cumsum(c(F,rowSums(cc[-1,]& cc[-nrow(cc),])==0))

#find the IDs to keep from each section
keeps<-lapply(split.data.frame(cc, groups), function(m) {
    lengths <- colSums(m)
    mx <- which.max(lengths)
    gx <- setdiff(which(lengths>0), mx)
    if(length(gx)>0) {
        if(df$level[mx] > max(df$level[gx])) {
            mx
        } else {
            gx
        }
    } else {
        mx
    }
})
方法4(打开/关闭列表) 我还有最后一个方法。这可能是最具伸缩性的。我们基本上融合了位置,并跟踪开始和结束事件,以确定组。然后我们分成两组,看看每组中最长的一组是否有最大值。最终我们返回ID。此方法使用所有标准基函数

#create open/close listing
dd<-rbind(
    cbind(df[,c(1,4)],pos=df[,2], evt=1),
    cbind(df[,c(1,4)],pos=df[,3], evt=-1)
)

#annotate with useful info
dd<-dd[order(dd$pos, -dd$evt),]
dd$open <- cumsum(dd$evt)
dd$group <- cumsum(c(0,head(dd$open,-1)==0))
dd$width <- ave(dd$pos, dd$id, FUN=function(x) diff(range(x)))

#slim down
dd <- subset(dd, evt==1,select=c("id","level","width","group"))

#process each group
ids<-unlist(lapply(split(dd, dd$group), function(x) {
    if(nrow(x)==1) return(x$id)
    mw<-which.max(x$width)
    ml<-which.max(x$level)
    if(mw==ml) {
        return(x$id[mw])
    } else {
        return(x$id[-mw])
    }
}))
到现在为止,我想你知道这会带来什么

总结
因此,如果实际数据的ID类型与示例数据相同,那么方法1显然是更好、更直接的选择。我仍然希望有一种方法可以简化我刚刚错过的方法2。我没有对这些方法的效率或性能做过任何测试。我猜方法4可能是最有效的,因为它应该线性扩展。

我将采用程序方法;基本上,按级别递减排序, 对于每个记录,删除具有匹配id的后续记录

df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), 
                 level = c(1,5,2,3,6,4,2,1,1), stringsAsFactors=FALSE)

#sort
ids <- df[order(df$level, decreasing=TRUE), "id"]

#split
ids <- sapply(df$id, strsplit, ",")

i <- 1

while( i < length(ids)) {
  current <- ids[[i]]

  j <- i + 1
  while(j <= length(ids)) {
    if(any(ids[[j]] %in% current)) 
      ids[[j]] <- NULL
    else 
      j <- j + 1
  }
  i <- i + 1
}

df思考这个问题让我头疼。有人请提供一个2行的解决方案,使用一些哈德利函数,我以前没有见过。
findPath <- function(gg, fromv, tov) {
    if ((missing(tov) && length(incident(gg, fromv, "in"))>1) || 
        (!missing(tov) && V(gg)[fromv]==V(gg)[tov])) {
        return (list(level=0, path=numeric()))
    }
    es <- E(gg)[from(fromv)]
    if (length(es)>1) {
        pp <- lapply(get.edges(gg, es)[,2], function(v) {
            edg <- E(gg)[fromv %--% v]
            lvl <- edg$level
            nxt <- findPaths(gg,v)
            return (list(level=max(lvl, nxt$level), path=c(edg,nxt$path)))
        })
        lvl <- sapply(pp, `[[`, "level")
        take <- pp[[which.max(lvl)]]
        nxt <- findPaths(gg, get.edges(gg, tail(take$path,1))[,2], tov)
        return (list(level=max(take$level, nxt$level), path=c(take$path, nxt$path)))
    } else  {
        lvl <- E(gg)[es]$level
        nv <- get.edges(gg,es)[,2]
        nxt <- findPaths(gg, nv, tov)
        return (list(level=max(lvl, nxt$level), path=c(es, nxt$path)))
    }
}
rr <- findPaths(gg, "100","950")$path
df[df$id %in% na.omit(E(gg)[rr]$id), ]

#      id start end level
# 2 a,b,c   100 350     5
# 4     d   400 450     3
# 5     h   800 850     6
# 6     e   500 550     4
# 7     i   900 950     2
#find unique positions and create overlap matrix
un<-sort(unique(unlist(df[,2:3])))    
cc<-sapply(1:nrow(df), function(i) between(un, df$start[i], df$end[i]))

#partition into non-overlapping sections
groups<-cumsum(c(F,rowSums(cc[-1,]& cc[-nrow(cc),])==0))

#find the IDs to keep from each section
keeps<-lapply(split.data.frame(cc, groups), function(m) {
    lengths <- colSums(m)
    mx <- which.max(lengths)
    gx <- setdiff(which(lengths>0), mx)
    if(length(gx)>0) {
        if(df$level[mx] > max(df$level[gx])) {
            mx
        } else {
            gx
        }
    } else {
        mx
    }
})
df[unlist(keeps),]
#create open/close listing
dd<-rbind(
    cbind(df[,c(1,4)],pos=df[,2], evt=1),
    cbind(df[,c(1,4)],pos=df[,3], evt=-1)
)

#annotate with useful info
dd<-dd[order(dd$pos, -dd$evt),]
dd$open <- cumsum(dd$evt)
dd$group <- cumsum(c(0,head(dd$open,-1)==0))
dd$width <- ave(dd$pos, dd$id, FUN=function(x) diff(range(x)))

#slim down
dd <- subset(dd, evt==1,select=c("id","level","width","group"))

#process each group
ids<-unlist(lapply(split(dd, dd$group), function(x) {
    if(nrow(x)==1) return(x$id)
    mw<-which.max(x$width)
    ml<-which.max(x$level)
    if(mw==ml) {
        return(x$id[mw])
    } else {
        return(x$id[-mw])
    }
}))
df[df$id %in% ids, ]
df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), 
                 level = c(1,5,2,3,6,4,2,1,1), stringsAsFactors=FALSE)

#sort
ids <- df[order(df$level, decreasing=TRUE), "id"]

#split
ids <- sapply(df$id, strsplit, ",")

i <- 1

while( i < length(ids)) {
  current <- ids[[i]]

  j <- i + 1
  while(j <= length(ids)) {
    if(any(ids[[j]] %in% current)) 
      ids[[j]] <- NULL
    else 
      j <- j + 1
  }
  i <- i + 1
}
R> ids <- data.frame(id=names(ids), stringsAsFactors=FALSE)

R> merge(ids, df, sort=FALSE)
     id start end level
1     h   800 850     6
2 a,b,c   100 350     5
3     e   500 550     4
4     d   400 450     3
5     i   900 950     2