Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/78.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 删除重复项,但保留最完整的迭代_R_Duplicates - Fatal编程技术网

R 删除重复项,但保留最完整的迭代

R 删除重复项,但保留最完整的迭代,r,duplicates,R,Duplicates,我试图找出如何基于三个变量(id、key和num)删除重复项。我想删除的列填充量最少的重复。如果填充的数字相等,则可以删除其中任何一个。 比如说, Original <- data.frame(id= c(1,2,2,3,3,4,5,5), key=c(1,2,2,3,3,4,5,5), num=c(1,1,1,1,1,1,1,1), v4= c(1,NA,5,5,NA,5,NA,7), v5=c(1,NA,5,5,NA,5,NA,7)) Original您可以聚合数据并选择具有最大

我试图找出如何基于三个变量(
id、key和num
)删除重复项。我想删除的列填充量最少的重复。如果填充的数字相等,则可以删除其中任何一个。 比如说,

Original <- data.frame(id= c(1,2,2,3,3,4,5,5), 
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7), 
v5=c(1,NA,5,5,NA,5,NA,7))

Original您可以聚合数据并选择具有最大分数的行:

Original <- data.frame(id= c(1,2,2,3,3,4,5,5), 
                       key=c(1,2,2,3,3,4,5,5),
                       num=c(1,1,1,1,1,1,1,1),
                       v4= c(1,NA,5,5,NA,5,NA,7), 
                       v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))

#get the score 
Original$present <- rowSums(Present)

#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")

library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
      Max = max(present))

Original这里有一个解决方案。它不是很漂亮,但应该适用于您的应用程序:

#Order by the degree of completeness    
Original<-Original[order(CompleteNess),]

#Starting from the bottom select the not duplicated rows 
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
#按完整程度排序
起初的
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
Original <- data.frame(id= c(1,2,2,3,3,4,5,5), 
                       key=c(1,2,2,3,3,4,5,5),
                       num=c(1,1,1,1,1,1,1,1),
                       v4= c(1,NA,5,5,NA,5,NA,7), 
                       v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))

#get the score 
Original$present <- rowSums(Present)

#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")

library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
      Max = max(present))
Final <- ddply(Original,.(id.key.num),summarize,
      Max = max(present),
      v4 = v4[which.max(present)],
      v5 = v5[which.max(present)]
      )
#Order by the degree of completeness    
Original<-Original[order(CompleteNess),]

#Starting from the bottom select the not duplicated rows 
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]