R 计算csv中的聚集值

R 计算csv中的聚集值,r,csv,count,gaps-and-islands,R,Csv,Count,Gaps And Islands,我有一个csv文件,其中的行包含一个名称,后跟一系列空值和聚集的实值 Robert,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,2:00-4:00 John,,,1:00-5:00,1:00-5:00,,,,,,,,,,,, Casey,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,,, Sarah,,,1:00-5:00,,,,,,,,2:00-4:00,2

我有一个csv文件,其中的行包含一个名称,后跟一系列空值和聚集的实值

Robert,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,2:00-4:00
John,,,1:00-5:00,1:00-5:00,,,,,,,,,,,,
Casey,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,,,
Sarah,,,1:00-5:00,,,,,,,,2:00-4:00,2:00-4:00,2:00-4:00,,
我想用R写一个脚本来计算集群。如果行中有三个实际的顺序值,那么我想将它们计算为一个集群。如果有任何小于三个集群的值,即一个或两个连续值,那么我想将其计算为一个单独的集群

csv格式的所需输出:

Robert,2,0
John,0,1
Casey,1,1
Sarah,1,1
编辑自: 代码导入的csv确实有一个标题,但我希望代码忽略标题并从第一行读取,即Robert,,,1:00-5:00,。。。。我还想忽略导入的csv文件的最后一列,它包含每个人工作的总小时数。下面是一个github,带有指向示例csv的链接:

下面是这个旧问题的一个可能的data.table解决方案,它使用

用于读取输入文件的fread, 用于整形的熔化/铸造, 和rleid功能,以确定差距和岛屿。 对于问题中发布的数据集,此代码

library(data.table)
library(magrittr)

fread("input.csv", header = FALSE, na.strings = c(""), fill = TRUE) %>% 
  .[, V1 := forcats::fct_inorder(V1)] %>%  # to keep the original order in dcast() below
  melt(id.var = "V1") %>% 
  setorder(V1, variable) %>% 
  .[, cluster.id := rleid(V1, is.na(value))] %>%
  .[!is.na(value), .N, by = .(V1, cluster.id)] %>% 
  dcast(V1 ~ N < 3, length, value.var = "N") %>% 
  fwrite("output.csv", col.names = FALSE)
,OP提供了到github上托管的另一个示例数据集的链接

经过一些修改

fread("https://raw.githubusercontent.com/agrobins/r_IslandCount/test_files/timeclock_report.csv"
      , drop = "total hours", na.strings = c("")) %>% 
  .[, Employee := forcats::fct_inorder(Employee)] %>%  # to keep the original order in dcast() below
  melt(id.var = "Employee") %>% 
  setorder(Employee, variable) %>% 
  .[, cluster.id := rleid(Employee, is.na(value))] %>% 
  .[!is.na(value), .N, .(Employee, cluster.id)] %>% 
  dcast(Employee ~ N < 3, length, value.var = "N")
第一个名为FALSE的数字列包含由三个或更多连续项组成的群集数量,而第二个名为TRUE的数字列包含由1个或2个连续项组成的群集数量

可再现数据 由于到外部网站的链接很脆弱,下面是从中检索到的第二个数据集的副本


有一些数据要共享吗?简单地说,您可以使用Reforme2库df Hi执行此操作,谢谢您的回复。代码导入的csv确实有一个标题,但我希望代码忽略标题并从第一行读取,即Robert,,,1:00-5:00,。。。。我还想忽略导入的csv文件的最后一列,它包含每个人工作的总小时数。请原谅文件格式不好,这不是最容易处理的。下面是一个github,带有指向示例csv的链接:timeclock_report.csv
Robert,2,0
John,0,1
Casey,1,1
Sarah,1,1
fread("https://raw.githubusercontent.com/agrobins/r_IslandCount/test_files/timeclock_report.csv"
      , drop = "total hours", na.strings = c("")) %>% 
  .[, Employee := forcats::fct_inorder(Employee)] %>%  # to keep the original order in dcast() below
  melt(id.var = "Employee") %>% 
  setorder(Employee, variable) %>% 
  .[, cluster.id := rleid(Employee, is.na(value))] %>% 
  .[!is.na(value), .N, .(Employee, cluster.id)] %>% 
  dcast(Employee ~ N < 3, length, value.var = "N")
          Employee FALSE TRUE
1:      John Smith     1    1
2:     Emily Smith     0    1
3:  Robert Jenkins     0    2
4: Rachel Lipscomb     0    1
5:   Donald Driver     1    0
Employee,"Mar 23, 2015","Mar 24, 2015","Mar 25, 2015","Mar 26, 2015","Mar 27, 2015","Mar 28, 2015","Mar 29, 2015",total hours
"John Smith",16:35 - 21:17 / 4.7,16:35 - 21:17 / 4.7,16:35 - 21:17 / 4.7,,,,11:17 - 16:08 / 4.85,18.9569
"Emily Smith",,,,,,08:13 - 12:40 / 4.45,,4.4472222222222
"Robert Jenkins",16:54 - 21:11 / 4.29,16:54 - 21:11 / 4.29,,,16:22 - 22:59 / 6.61,,,15.18638
"Rachel Lipscomb",,,,,,13:18 - 19:04 / 5.76,,5.7638888888889
"Donald Driver",,,,,08:13 - 13:05 / 4.86,08:13 - 13:05 / 4.86,10:02 - 16:02 / 6,15.14694