R 删除重复项,但保留最完整的迭代
我试图找出如何基于三个变量(R 删除重复项,但保留最完整的迭代,r,duplicates,R,Duplicates,我试图找出如何基于三个变量(id、key和num)删除重复项。我想删除的列填充量最少的重复。如果填充的数字相等,则可以删除其中任何一个。 比如说, Original <- data.frame(id= c(1,2,2,3,3,4,5,5), key=c(1,2,2,3,3,4,5,5), num=c(1,1,1,1,1,1,1,1), v4= c(1,NA,5,5,NA,5,NA,7), v5=c(1,NA,5,5,NA,5,NA,7)) Original您可以聚合数据并选择具有最大
id、key和num
)删除重复项。我想删除的列填充量最少的重复。如果填充的数字相等,则可以删除其中任何一个。
比如说,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Original您可以聚合数据并选择具有最大分数的行:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
Original这里有一个解决方案。它不是很漂亮,但应该适用于您的应用程序:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
#按完整程度排序
起初的
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]