R “stat_alluvium()”中的计算失败:每行输出必须由唯一的键组合标识

R “stat_alluvium()”中的计算失败:每行输出必须由唯一的键组合标识,r,ggplot2,dplyr,tidyverse,R,Ggplot2,Dplyr,Tidyverse,我有一个data.frame,它是用一系列Tidyverse工具构建的,大部分是带有管道的dplyr工具。对于您可以在中找到的geom_flow()示例,数据看起来格式正确。我的数据集从MSSQL数据库导入后经历了一堆迭代,大约有300k行。因此,我创建了一个虚拟版本,当我开始为GGM设置它时,它会报告所有相同的类和格式,我也这样做是为了查看错误是否可以在更小的范围内重现,以便更好地进行故障排除 data <- data.frame(Employee = as.numeric(c(1450

我有一个data.frame,它是用一系列Tidyverse工具构建的,大部分是带有管道的dplyr工具。对于您可以在中找到的geom_flow()示例,数据看起来格式正确。我的数据集从MSSQL数据库导入后经历了一堆迭代,大约有300k行。因此,我创建了一个虚拟版本,当我开始为GGM设置它时,它会报告所有相同的类和格式,我也这样做是为了查看错误是否可以在更小的范围内重现,以便更好地进行故障排除

data <- data.frame(Employee = as.numeric(c(1450,1450,1450,1450,1460,1460,1460,1460,1470,1470)),
                  PostDate = as.POSIXct(c("2019-08-15","2019-09-12","2019-09-15","2019-10-12","2019-08-15","2019-09-12","2019-09-15","2019-10-12","2019-08-15","2019-09-12")),
                  Job = as.character(c("1901", "1901","1902","1902","1901", "1901","1902","1902","1901", "1901")),
                  Phase = as.character(c("950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-")),
                  Craft = as.character(c("Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab")),
                  Class = as.character(c("1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B")),
                  EarnCode  = as.numeric(c("51", "51", "51", "51", "51", "51", "51", "51", "51", "51")),
                  Hours = as.numeric(c(8, 8, 7, 6, 5, 4, 12, 3, 8, 9)),
                  Rate = as.numeric(c(50, 50, 50, 50, 50, 50, 50, 50, 50, 50)),
                  Amt = as.numeric(c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)),
                  LastName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")),
                  FirstName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")), stringsAsFactors=FALSE)
preEdit编辑:我在打这篇文章的时候,我注意到这项工作创造了一个条件,在这个条件下,一名员工将被计算在图表中每一条的两个位置。我认为这可能是问题的一部分,因此我重新调整了测试数据,并测试了更改为以下数据:

        data <- data.frame(Employee = as.numeric(c(1450,1450,1450,1450,1460,1460,1460,1460,1470,1470)),
                       PostDate = as.POSIXct(c("2019-08-15","2019-08-12","2019-09-15","2019-10-12","2019-08-15","2019-08-12","2019-09-15","2019-10-12","2019-08-15","2019-09-12")),
                       Job = as.character(c("1901", "1901","1902","1902","1901", "1901","1902","1902","1901", "1901")),
                       Phase = as.character(c("950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-")),
                       Craft = as.character(c("Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab")),
                       Class = as.character(c("1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B")),
                       EarnCode  = as.numeric(c("51", "51", "51", "51", "51", "51", "51", "51", "51", "51")),
                       Hours = as.numeric(c(8, 8, 7, 6, 5, 4, 12, 3, 8, 9)),
                       Rate = as.numeric(c(50, 50, 50, 50, 50, 50, 50, 50, 50, 50)),
                       Amt = as.numeric(c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)),
                       LastName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")),
                       FirstName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")), stringsAsFactors=FALSE)
现在我得到了相同的减少到287行,但是现在错误是

Each row of output must be identified by a unique combination of keys.
Keys are shared for 1 rows:
\* 109, 110
查看Rstudios View()中的这两行,我不明白为什么它仍然将它们标记为共享键


106 1906-   2019-10-01  4267    1   91.5;
107 1906-   2019-10-01  4317    1   119.0
108 1907-   2019-08-01  582     1   406.0
109  1907-   2019-08-01  705     1   396.0
110  1907-   2019-08-01  1224    1   229.5
111 1907-   2019-08-01  1700    1   179.5
112 1907-   2019-08-01  1744    1   235.0
113 1907-   2019-08-01  1959    1   234.5
任何避免此错误的进一步建议都将非常有用。我发现搜索它非常令人沮丧,因为绝大多数搜索结果都是针对spread()的,并且没有明显的相关性。可能是我的dplyr或GGR命令中的一个,它使用了几层,但它的级别不是我可以解决的

data <- data.frame(Employee = as.numeric(c(1450,1450,1450,1450,1460,1460,1460,1460,1470,1470)),
                  PostDate = as.POSIXct(c("2019-08-15","2019-09-12","2019-09-15","2019-10-12","2019-08-15","2019-09-12","2019-09-15","2019-10-12","2019-08-15","2019-09-12")),
                  Job = as.character(c("1901", "1901","1902","1902","1901", "1901","1902","1902","1901", "1901")),
                  Phase = as.character(c("950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-")),
                  Craft = as.character(c("Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab")),
                  Class = as.character(c("1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B")),
                  EarnCode  = as.numeric(c("51", "51", "51", "51", "51", "51", "51", "51", "51", "51")),
                  Hours = as.numeric(c(8, 8, 7, 6, 5, 4, 12, 3, 8, 9)),
                  Rate = as.numeric(c(50, 50, 50, 50, 50, 50, 50, 50, 50, 50)),
                  Amt = as.numeric(c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)),
                  LastName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")),
                  FirstName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")), stringsAsFactors=FALSE)

有没有一个好的方法可以避免错误的发生?为什么我的第109行和第110行仍然标记为重复,而它们显然彼此不重复。这将最终进入一个闪亮的应用程序,因此我的解决方案需要对用户输入的日期范围保持健壮。

我对这个解决方案不满意,但至少它可以工作。我想要分组的两个变量在分组依据(员工,月份)%>%时不起作用,因此我只在分组中添加了一个mutate,但添加了mutate组合变量,然后使用distinct确保组合变量没有重复项

    df_data_Aluv <- data %>%
        filter(PostDate >= "2019-08-01" & PostDate <= "2019-10-30") %>%
        select(date = PostDate, Employee, Job, Hours) %>%
        group_by(Job, month = as.character(floor_date(date, "month")), Employee) %>%
        summarize(freq = n_distinct(Employee), Hours = sum(Hours)) %>%
        mutate(empmon = paste(Employee, " -- ", month)) %>%
        group_by(empmon) %>%
        filter(Hours == max(Hours)) %>%
        distinct(empmon, .keep_all = TRUE)  
df_数据_Aluv%
过滤器(发布日期>=“2019-08-01”和发布日期%
选择(日期=发布日期、员工、工作、小时)%>%
分组人(职务,月份=身份字符(楼层日期,月份)),员工%>%
汇总(频率=不一致(员工),小时数=总和(小时))%>%
突变(empmon=粘贴(员工,“-”,月份))%>%
分组依据(empmon)%>%
过滤器(小时==最大(小时))%>%
不同(empmon、.keep_all=TRUE)
它适用于任何日期范围,所以我至少得到了我想要的

Each row of output must be identified by a unique combination of keys.
Keys are shared for 287 rows:
\* 1, 2
\* 3, 4
\* 5, 6
 ... (lists ever row this way)
Each row of output must be identified by a unique combination of keys.
Keys are shared for 1 rows:
\* 109, 110
    df_data_Aluv <- data %>%
        filter(PostDate >= "2019-08-01" & PostDate <= "2019-10-30") %>%
        select(date = PostDate, Employee, Job, Hours) %>%
        group_by(Job, month = as.character(floor_date(date, "month")), Employee) %>%
        summarize(freq = n_distinct(Employee), Hours = sum(Hours)) %>%
        mutate(empmon = paste(Employee, " -- ", month)) %>%
        group_by(empmon) %>%
        filter(Hours == max(Hours)) %>%
        distinct(empmon, .keep_all = TRUE)