R 查找组中特定列的唯一值_R_Dplyr_Distinct

R 查找组中特定列的唯一值

R 查找组中特定列的唯一值,r,dplyr,distinct,R,Dplyr,Distinct,尝试使用lengthuniqueID时，会给出总行数，而不是特定组中的行数 data<-sqldf("select count(distinct ID) as distinctID,count(type) as rowCount,type,ag_id,Outcome,bdate,sd_num from buy_pattern group by ag_id,Outcome,sd_num,bdate") # > data # distinctID rowCount type a

尝试使用lengthuniqueID时，会给出总行数，而不是特定组中的行数

data<-sqldf("select count(distinct ID) as distinctID,count(type) as rowCount,type,ag_id,Outcome,bdate,sd_num from buy_pattern group by ag_id,Outcome,sd_num,bdate")


 # > data 
 # distinctID rowCount type ag_id    Outcome bdate  sd_num
 # 1          2        7   A1 A0001 Aggressive  2012 AIG0001
 # 2          1        1   B1 B0001   Balanced  2012 AIG0001

主要原因是“ID”作为对象在全局环境中创建为向量，在dplyr链中，select没有调用“ID”，导致“ID”从全局环境中获取。整个向量“ID”将有3个唯一的元素，它不会一步一步地跟随组_。基本上，将“ID”保留在select中可以解决问题。有n_distinct可替代lengthunique

我们可以使用n_distinct，原因是您在selectsapplysplitbuy_模式$ID、buy_模式$Outcome、uniqueor tapplybuy_模式$ID、buy_模式$Outcome中没有ID，唯一性取决于您对组的定义，在本例中，mutate和summary之间有什么区别吗？@Akki区别在于mutate将包含所有列，然后当您切片时，它将给出每个组的第一行，其中as summary不会给出“ID”，即在group_by和新的摘要列将出现在输出中摘要将在350万行上运行此示例时提供性能优势？@Akki With mutate，您正在创建列，而摘要只是对其进行摘要。因此，性能会有所提高

data<-sqldf("select count(distinct ID) as distinctID,count(type) as rowCount,type,ag_id,Outcome,bdate,sd_num from buy_pattern group by ag_id,Outcome,sd_num,bdate")


 # > data 
 # distinctID rowCount type ag_id    Outcome bdate  sd_num
 # 1          2        7   A1 A0001 Aggressive  2012 AIG0001
 # 2          1        1   B1 B0001   Balanced  2012 AIG0001

    data<-buy_pattern %>% select(type,ag_id,Outcome,bdate,sd_num) %>% 
    group_by(type,ag_id,Outcome,sd_num,bdate) %>%    
    mutate(rowCount = n(),distinctID=length(unique(ID))) %>% 
    arrange(ag_id,Outcome,sd_num, desc(rowCount))  %>% 
    slice(1)     

 # > data

 #  distinctID rowCount type ag_id    Outcome bdate  sd_num
 #  1          3        7   A1 A0001 Aggressive  2012 AIG0001
 #  2          3        1   B1 B0001   Balanced  2012 AIG0001

buy_pattern %>% 
      select(ID, type,ag_id,Outcome,bdate,sd_num) %>% # change here
      group_by(type,ag_id,Outcome,sd_num,bdate) %>%
      mutate(rowCount = n(),distinctID=length(unique(ID))) %>% 
      arrange(ag_id,Outcome,sd_num, desc(rowCount))  %>% 
      slice(1) 
# A tibble: 2 x 8
# Groups:   type, ag_id, Outcome, sd_num, bdate [2]
#     ID   type  ag_id    Outcome  bdate  sd_num rowCount distinctID
#   <dbl> <fctr> <fctr>     <fctr> <fctr>  <fctr>    <int>      <int>
#1     1     A1  A0001 Aggressive   2012 AIG0001        7          2
#2     3     B1  B0001   Balanced   2012 AIG0001        1          1

buy_pattern %>%
     group_by(type, ag_id, Outcome, sd_num, bdate) %>%
     summarise(rowCount = n(), distinctID = n_distinct(ID))