R 查找组中特定列的唯一值
尝试使用lengthuniqueID时,会给出总行数,而不是特定组中的行数R 查找组中特定列的唯一值,r,dplyr,distinct,R,Dplyr,Distinct,尝试使用lengthuniqueID时,会给出总行数,而不是特定组中的行数 data<-sqldf("select count(distinct ID) as distinctID,count(type) as rowCount,type,ag_id,Outcome,bdate,sd_num from buy_pattern group by ag_id,Outcome,sd_num,bdate") # > data # distinctID rowCount type a
data<-sqldf("select count(distinct ID) as distinctID,count(type) as rowCount,type,ag_id,Outcome,bdate,sd_num from buy_pattern group by ag_id,Outcome,sd_num,bdate")
# > data
# distinctID rowCount type ag_id Outcome bdate sd_num
# 1 2 7 A1 A0001 Aggressive 2012 AIG0001
# 2 1 1 B1 B0001 Balanced 2012 AIG0001
主要原因是“ID”作为对象在全局环境中创建为向量,在dplyr链中,select没有调用“ID”,导致“ID”从全局环境中获取。整个向量“ID”将有3个唯一的元素,它不会一步一步地跟随组_。基本上,将“ID”保留在select中可以解决问题。有n_distinct可替代lengthunique
我们可以使用n_distinct,原因是您在selectsapplysplitbuy_模式$ID、buy_模式$Outcome、uniqueor tapplybuy_模式$ID、buy_模式$Outcome中没有ID,唯一性取决于您对组的定义,在本例中,mutate和summary之间有什么区别吗?@Akki区别在于mutate将包含所有列,然后当您切片时,它将给出每个组的第一行,其中as summary不会给出“ID”,即在group_by和新的摘要列将出现在输出中摘要将在350万行上运行此示例时提供性能优势?@Akki With mutate,您正在创建列,而摘要只是对其进行摘要。因此,性能会有所提高
data<-sqldf("select count(distinct ID) as distinctID,count(type) as rowCount,type,ag_id,Outcome,bdate,sd_num from buy_pattern group by ag_id,Outcome,sd_num,bdate")
# > data
# distinctID rowCount type ag_id Outcome bdate sd_num
# 1 2 7 A1 A0001 Aggressive 2012 AIG0001
# 2 1 1 B1 B0001 Balanced 2012 AIG0001
data<-buy_pattern %>% select(type,ag_id,Outcome,bdate,sd_num) %>%
group_by(type,ag_id,Outcome,sd_num,bdate) %>%
mutate(rowCount = n(),distinctID=length(unique(ID))) %>%
arrange(ag_id,Outcome,sd_num, desc(rowCount)) %>%
slice(1)
# > data
# distinctID rowCount type ag_id Outcome bdate sd_num
# 1 3 7 A1 A0001 Aggressive 2012 AIG0001
# 2 3 1 B1 B0001 Balanced 2012 AIG0001
buy_pattern %>%
select(ID, type,ag_id,Outcome,bdate,sd_num) %>% # change here
group_by(type,ag_id,Outcome,sd_num,bdate) %>%
mutate(rowCount = n(),distinctID=length(unique(ID))) %>%
arrange(ag_id,Outcome,sd_num, desc(rowCount)) %>%
slice(1)
# A tibble: 2 x 8
# Groups: type, ag_id, Outcome, sd_num, bdate [2]
# ID type ag_id Outcome bdate sd_num rowCount distinctID
# <dbl> <fctr> <fctr> <fctr> <fctr> <fctr> <int> <int>
#1 1 A1 A0001 Aggressive 2012 AIG0001 7 2
#2 3 B1 B0001 Balanced 2012 AIG0001 1 1
buy_pattern %>%
group_by(type, ag_id, Outcome, sd_num, bdate) %>%
summarise(rowCount = n(), distinctID = n_distinct(ID))