从R中的数据帧中删除重复项_R_Dataframe_Duplicates_Apply

从R中的数据帧中删除重复项

r dataframe

从R中的数据帧中删除重复项,r,dataframe,duplicates,apply,R,Dataframe,Duplicates,Apply,我有这些数据 UserID Quiz_answers Quiz_Date 1 `a1,a2,a3`Positive 26-01-2017 1 `a1,a4,a3`Positive 26-01-2017 1 `a1,a2,a4`Negative 28-02-2017 1 `a1,a2,a3`Neutral 30-1

我有这些数据

UserID   Quiz_answers            Quiz_Date       
  1     `a1,a2,a3`Positive       26-01-2017        
  1     `a1,a4,a3`Positive       26-01-2017        
  1     `a1,a2,a4`Negative       28-02-2017        
  1     `a1,a2,a3`Neutral        30-10-2017        
  1     `a1,a2,a4`Positive       30-11-2017        
  1     `a1,a2,a4`Negative       28-02-2018    

  2     `a1,a2,a3`Negative       27-01-2017            
  2     `a1,a7,a3`Neutral        28-08-2017        
  2     `a1,a2,a5`Negative       28-01-2017

我要删除重复的行：
重复行的规则包括：

小测验答案栏中反勾（`）后出现的单词是相同的

对于此类行，如果userID和quick_Date列的值也相同，则该行是重复的`

 UserID<-c(1,1,1,1,1,1,2,2,2)
 Quiz_answers<-c("`a1,a2,a3`Positive","`a1,a4,a3`Positive","`a1,a2,a4`Negative","a1,a2,a3`Neutral","`a1,a2,a4`Positive","`a1,a2,a4`Negative","`a1,a2,a3`Negative","`a1,a7,a3`Neutral","`a1,a2,a5`Negative")  
 Quiz_Date<-as.Date(c("26-01-2017","26-01-2017","28-02-2017","30-10-2017","30-11-2017","28-02-2018","27-01-2017","28-08-2017","28-01-2017"),'%d-%m-%Y')  
 data<-data.frame(UserID,Quiz_answers,Quiz_Date)

根据规则，只有第二行应该从数据帧中删除，这是唯一满足重复条件的行。我做错了什么？

试试看

你的数据

df <- read.table(text="UserID   Quiz_answers            Quiz_Date       
1     `a1,a2,a3`Positive       26-01-2017        
1     `a1,a4,a3`Positive       26-01-2017        
1     `a1,a2,a4`Negative       28-02-2017        
1     `a1,a2,a3`Neutral        30-10-2017        
1     `a1,a2,a4`Positive       30-11-2017        
1     `a1,a2,a4`Negative       28-02-2018    
2     `a1,a2,a3`Negative       27-01-2017            
2     `a1,a7,a3`Neutral        28-08-2017        
2     `a1,a2,a5`Negative       28-01-2017", header = TRUE, stringsAsFactors=FALSE)

df%
分组依据（grp、用户ID、测验日期）%>%
切片（1）%>%
解组（）%>%
选择（-grp）%>%
安排（用户ID、测验日期）
#一个tibble:8x3
#用户ID测验\答案测验\日期
#                     
#2017年1月26日11'a1、a2、a3'阳性
#2 1`a1、a2、a4`负2017年2月28日
#3 1`a1、a2、a4`负2018年2月28日
#4 1`a1、a2、a3`中性2017年10月30日
#5 1`a1、a2、a4`2017年11月30日
#6 2`a1、a2、a3`负2017年1月27日
#7 2`a1、a2、a5`负2017年1月28日
#8 2`a1、a7、a3`中性2017年8月28日

您可以使用

sqldf

包，如下所示。首先，找到

阳性

、

阴性

和

中性

组。然后，使用

分组依据

过滤重复项：

require("sqldf")
result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date 
       UNION 
       SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date 
       UNION 
       SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")

这是一个两行解决方案，仅使用基本R：

data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))

data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers")   ]))), !(names(data) %in% "group")]

data[，“group”]data[！（duplicated（data[-2]）&duplicated（gsub（'`.*`'，''，data$quick_answers）），]为了简单起见，我没有在我的问题中添加更多的条件。实际上，复制的第三条规则是日期差应该大于1天，因此分组对我来说不起作用。很抱歉没有把它包括在问题中。
library(dplyr)
ans <- df %>%
        mutate(grp = sub(".*`(\\D+)$", "\\1", Quiz_answers)) %>%
        group_by(grp, UserID, Quiz_Date) %>%
        slice(1) %>%
        ungroup() %>%
        select(-grp) %>%
        arrange(UserID, Quiz_Date)

# A tibble: 8 x 3
  # UserID       Quiz_answers  Quiz_Date
   # <int>              <chr>      <chr>
# 1      1 `a1,a2,a3`Positive 26-01-2017
# 2      1 `a1,a2,a4`Negative 28-02-2017
# 3      1 `a1,a2,a4`Negative 28-02-2018
# 4      1  `a1,a2,a3`Neutral 30-10-2017
# 5      1 `a1,a2,a4`Positive 30-11-2017
# 6      2 `a1,a2,a3`Negative 27-01-2017
# 7      2 `a1,a2,a5`Negative 28-01-2017
# 8      2  `a1,a7,a3`Neutral 28-08-2017

require("sqldf")
result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date 
       UNION 
       SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date 
       UNION 
       SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")

  UserID       Quiz_answers  Quiz_Date
1      1  `a1,a2,a3`Neutral 30-10-2017
2      1 `a1,a2,a4`Negative 28-02-2017
3      1 `a1,a2,a4`Negative 28-02-2018
4      1 `a1,a2,a4`Positive 30-11-2017
5      1 `a1,a4,a3`Positive 26-01-2017
6      2 `a1,a2,a3`Negative 27-01-2017
7      2 `a1,a2,a5`Negative 28-01-2017
8      2  `a1,a7,a3`Neutral 28-08-2017

data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))

data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers")   ]))), !(names(data) %in% "group")]