从R中的数据帧中删除重复项
我有这些数据从R中的数据帧中删除重复项,r,dataframe,duplicates,apply,R,Dataframe,Duplicates,Apply,我有这些数据 UserID Quiz_answers Quiz_Date 1 `a1,a2,a3`Positive 26-01-2017 1 `a1,a4,a3`Positive 26-01-2017 1 `a1,a2,a4`Negative 28-02-2017 1 `a1,a2,a3`Neutral 30-1
UserID Quiz_answers Quiz_Date
1 `a1,a2,a3`Positive 26-01-2017
1 `a1,a4,a3`Positive 26-01-2017
1 `a1,a2,a4`Negative 28-02-2017
1 `a1,a2,a3`Neutral 30-10-2017
1 `a1,a2,a4`Positive 30-11-2017
1 `a1,a2,a4`Negative 28-02-2018
2 `a1,a2,a3`Negative 27-01-2017
2 `a1,a7,a3`Neutral 28-08-2017
2 `a1,a2,a5`Negative 28-01-2017
我要删除重复的行:重复行的规则包括:
UserID<-c(1,1,1,1,1,1,2,2,2)
Quiz_answers<-c("`a1,a2,a3`Positive","`a1,a4,a3`Positive","`a1,a2,a4`Negative","a1,a2,a3`Neutral","`a1,a2,a4`Positive","`a1,a2,a4`Negative","`a1,a2,a3`Negative","`a1,a7,a3`Neutral","`a1,a2,a5`Negative")
Quiz_Date<-as.Date(c("26-01-2017","26-01-2017","28-02-2017","30-10-2017","30-11-2017","28-02-2018","27-01-2017","28-08-2017","28-01-2017"),'%d-%m-%Y')
data<-data.frame(UserID,Quiz_answers,Quiz_Date)
根据规则,只有第二行应该从数据帧中删除,这是唯一满足重复条件的行。
我做错了什么?试试看
你的数据
df <- read.table(text="UserID Quiz_answers Quiz_Date
1 `a1,a2,a3`Positive 26-01-2017
1 `a1,a4,a3`Positive 26-01-2017
1 `a1,a2,a4`Negative 28-02-2017
1 `a1,a2,a3`Neutral 30-10-2017
1 `a1,a2,a4`Positive 30-11-2017
1 `a1,a2,a4`Negative 28-02-2018
2 `a1,a2,a3`Negative 27-01-2017
2 `a1,a7,a3`Neutral 28-08-2017
2 `a1,a2,a5`Negative 28-01-2017", header = TRUE, stringsAsFactors=FALSE)
df%
分组依据(grp、用户ID、测验日期)%>%
切片(1)%>%
解组()%>%
选择(-grp)%>%
安排(用户ID、测验日期)
#一个tibble:8x3
#用户ID测验\答案测验\日期
#
#2017年1月26日11'a1、a2、a3'阳性
#2 1`a1、a2、a4`负2017年2月28日
#3 1`a1、a2、a4`负2018年2月28日
#4 1`a1、a2、a3`中性2017年10月30日
#5 1`a1、a2、a4`2017年11月30日
#6 2`a1、a2、a3`负2017年1月27日
#7 2`a1、a2、a5`负2017年1月28日
#8 2`a1、a7、a3`中性2017年8月28日
您可以使用sqldf
包,如下所示。首先,找到阳性
、阴性
和中性
组。然后,使用分组依据
过滤重复项:
require("sqldf")
result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date
UNION
SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date
UNION
SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")
这是一个两行解决方案,仅使用基本R:
data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))
data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers") ]))), !(names(data) %in% "group")]
data[,“group”]data[!(duplicated(data[-2])&duplicated(gsub('`.*`','',data$quick_answers)),]
为了简单起见,我没有在我的问题中添加更多的条件。实际上,复制的第三条规则是日期差应该大于1天,因此分组对我来说不起作用。很抱歉没有把它包括在问题中。
library(dplyr)
ans <- df %>%
mutate(grp = sub(".*`(\\D+)$", "\\1", Quiz_answers)) %>%
group_by(grp, UserID, Quiz_Date) %>%
slice(1) %>%
ungroup() %>%
select(-grp) %>%
arrange(UserID, Quiz_Date)
# A tibble: 8 x 3
# UserID Quiz_answers Quiz_Date
# <int> <chr> <chr>
# 1 1 `a1,a2,a3`Positive 26-01-2017
# 2 1 `a1,a2,a4`Negative 28-02-2017
# 3 1 `a1,a2,a4`Negative 28-02-2018
# 4 1 `a1,a2,a3`Neutral 30-10-2017
# 5 1 `a1,a2,a4`Positive 30-11-2017
# 6 2 `a1,a2,a3`Negative 27-01-2017
# 7 2 `a1,a2,a5`Negative 28-01-2017
# 8 2 `a1,a7,a3`Neutral 28-08-2017
require("sqldf")
result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date
UNION
SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date
UNION
SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")
UserID Quiz_answers Quiz_Date
1 1 `a1,a2,a3`Neutral 30-10-2017
2 1 `a1,a2,a4`Negative 28-02-2017
3 1 `a1,a2,a4`Negative 28-02-2018
4 1 `a1,a2,a4`Positive 30-11-2017
5 1 `a1,a4,a3`Positive 26-01-2017
6 2 `a1,a2,a3`Negative 27-01-2017
7 2 `a1,a2,a5`Negative 28-01-2017
8 2 `a1,a7,a3`Neutral 28-08-2017
data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))
data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers") ]))), !(names(data) %in% "group")]