从R中的数据帧中删除重复项

从R中的数据帧中删除重复项,r,dataframe,duplicates,apply,R,Dataframe,Duplicates,Apply,我有这些数据 UserID Quiz_answers Quiz_Date 1 `a1,a2,a3`Positive 26-01-2017 1 `a1,a4,a3`Positive 26-01-2017 1 `a1,a2,a4`Negative 28-02-2017 1 `a1,a2,a3`Neutral 30-1

我有这些数据

UserID   Quiz_answers            Quiz_Date       
  1     `a1,a2,a3`Positive       26-01-2017        
  1     `a1,a4,a3`Positive       26-01-2017        
  1     `a1,a2,a4`Negative       28-02-2017        
  1     `a1,a2,a3`Neutral        30-10-2017        
  1     `a1,a2,a4`Positive       30-11-2017        
  1     `a1,a2,a4`Negative       28-02-2018    

  2     `a1,a2,a3`Negative       27-01-2017            
  2     `a1,a7,a3`Neutral        28-08-2017        
  2     `a1,a2,a5`Negative       28-01-2017  
我要删除重复的行:
重复行的规则包括:

  • 小测验答案栏中反勾(`)后出现的单词是相同的
  • 对于此类行,如果userID和quick_Date列的值也相同,则该行是重复的`

     UserID<-c(1,1,1,1,1,1,2,2,2)
     Quiz_answers<-c("`a1,a2,a3`Positive","`a1,a4,a3`Positive","`a1,a2,a4`Negative","a1,a2,a3`Neutral","`a1,a2,a4`Positive","`a1,a2,a4`Negative","`a1,a2,a3`Negative","`a1,a7,a3`Neutral","`a1,a2,a5`Negative")  
     Quiz_Date<-as.Date(c("26-01-2017","26-01-2017","28-02-2017","30-10-2017","30-11-2017","28-02-2018","27-01-2017","28-08-2017","28-01-2017"),'%d-%m-%Y')  
     data<-data.frame(UserID,Quiz_answers,Quiz_Date)     
    
    根据规则,只有第二行应该从数据帧中删除,这是唯一满足重复条件的行。 我做错了什么?

    试试看

    你的数据

    df <- read.table(text="UserID   Quiz_answers            Quiz_Date       
    1     `a1,a2,a3`Positive       26-01-2017        
    1     `a1,a4,a3`Positive       26-01-2017        
    1     `a1,a2,a4`Negative       28-02-2017        
    1     `a1,a2,a3`Neutral        30-10-2017        
    1     `a1,a2,a4`Positive       30-11-2017        
    1     `a1,a2,a4`Negative       28-02-2018    
    2     `a1,a2,a3`Negative       27-01-2017            
    2     `a1,a7,a3`Neutral        28-08-2017        
    2     `a1,a2,a5`Negative       28-01-2017", header = TRUE, stringsAsFactors=FALSE)
    
    df%
    分组依据(grp、用户ID、测验日期)%>%
    切片(1)%>%
    解组()%>%
    选择(-grp)%>%
    安排(用户ID、测验日期)
    #一个tibble:8x3
    #用户ID测验\答案测验\日期
    #                     
    #2017年1月26日11'a1、a2、a3'阳性
    #2 1`a1、a2、a4`负2017年2月28日
    #3 1`a1、a2、a4`负2018年2月28日
    #4 1`a1、a2、a3`中性2017年10月30日
    #5 1`a1、a2、a4`2017年11月30日
    #6 2`a1、a2、a3`负2017年1月27日
    #7 2`a1、a2、a5`负2017年1月28日
    #8 2`a1、a7、a3`中性2017年8月28日
    
    您可以使用
    sqldf
    包,如下所示。首先,找到
    阳性
    阴性
    中性
    组。然后,使用
    分组依据
    过滤重复项:

    require("sqldf")
    result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date 
           UNION 
           SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date 
           UNION 
           SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")
    

    这是一个两行解决方案,仅使用基本R:

    data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))
    
    data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers")   ]))), !(names(data) %in% "group")]
    

    data[,“group”]
    data[!(duplicated(data[-2])&duplicated(gsub('`.*`','',data$quick_answers)),]
    为了简单起见,我没有在我的问题中添加更多的条件。实际上,复制的第三条规则是日期差应该大于1天,因此分组对我来说不起作用。很抱歉没有把它包括在问题中。
    library(dplyr)
    ans <- df %>%
            mutate(grp = sub(".*`(\\D+)$", "\\1", Quiz_answers)) %>%
            group_by(grp, UserID, Quiz_Date) %>%
            slice(1) %>%
            ungroup() %>%
            select(-grp) %>%
            arrange(UserID, Quiz_Date)
    
    # A tibble: 8 x 3
      # UserID       Quiz_answers  Quiz_Date
       # <int>              <chr>      <chr>
    # 1      1 `a1,a2,a3`Positive 26-01-2017
    # 2      1 `a1,a2,a4`Negative 28-02-2017
    # 3      1 `a1,a2,a4`Negative 28-02-2018
    # 4      1  `a1,a2,a3`Neutral 30-10-2017
    # 5      1 `a1,a2,a4`Positive 30-11-2017
    # 6      2 `a1,a2,a3`Negative 27-01-2017
    # 7      2 `a1,a2,a5`Negative 28-01-2017
    # 8      2  `a1,a7,a3`Neutral 28-08-2017
    
    require("sqldf")
    result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date 
           UNION 
           SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date 
           UNION 
           SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")
    
      UserID       Quiz_answers  Quiz_Date
    1      1  `a1,a2,a3`Neutral 30-10-2017
    2      1 `a1,a2,a4`Negative 28-02-2017
    3      1 `a1,a2,a4`Negative 28-02-2018
    4      1 `a1,a2,a4`Positive 30-11-2017
    5      1 `a1,a4,a3`Positive 26-01-2017
    6      2 `a1,a2,a3`Negative 27-01-2017
    7      2 `a1,a2,a5`Negative 28-01-2017
    8      2  `a1,a7,a3`Neutral 28-08-2017
    
    data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))
    
    data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers")   ]))), !(names(data) %in% "group")]