R 如何根据其他列的值对列中的字符串进行复杂的转换?

R 如何根据其他列的值对列中的字符串进行复杂的转换?,r,function,dataframe,join,R,Function,Dataframe,Join,我有一个数据帧: ID time value operation K1 2020-10-12 07:35:47 K1-0735 create K1 2020-10-12 07:35:49 K1-0735 upload K1 2020-10-12 07:35:50 K1 2020-10-12 07:35:55 K1-0735 create K1 2020-1

我有一个数据帧:

ID          time             value       operation
K1   2020-10-12 07:35:47    K1-0735       create
K1   2020-10-12 07:35:49    K1-0735       upload
K1   2020-10-12 07:35:50    
K1   2020-10-12 07:35:55    K1-0735       create   
K1   2020-10-12 07:35:58    K1-0735       upload
K1   2020-10-12 07:37:19    
KK   2020-10-13 08:11:09    KK-0811       create
KK   2020-10-13 08:11:09    KK-0811       create
KK   2020-10-13 08:11:12    KK-0811       upload       
KK   2020-10-13 08:11:15
KK   2020-10-13 08:11:25    KK-0811       create
KK   2020-10-13 08:11:26    KK-0811       upload  
ID <- c("K1","K1","K1","K1","K1","K1","KK","KK","KK","KK","KK","KK")

time <- as.Date(c('2020-10-12 07:35:47','2020-10-12 07:35:49','2020-10-12 07:35:50',
                     '2020-10-12 07:35:55','2020-10-12 07:35:58', '2020-10-12 07:37:19',
                     '2020-10-13 08:11:09','2020-10-13 08:11:09','2020-10-13 08:11:12',
                     '2020-10-13 08:11:15','2020-10-13 08:11:25','2020-10-13 08:11:26'))

value <- c("K1-0735","K1-0735",NA,"K1-0735","K1-0735",NA,"KK-0811","KK-0811","KK-0811",
           NA,"KK-0811","KK-0811")

operation <- c("create", "upload", NA,"create", "upload", NA,"create","create", "upload",
               NA,"create", "upload")
data <- data.frame(ID, time, value,operation)
正如您所看到的,列值是列ID的值和时间戳中的小时和分钟的连接,不带空格。我想在该值上加上秒数,使其在一分钟内唯一。但是,它们必须与创建和上载值的“操作包”列中的秒数相同

因此,期望的结果是:

ID          time             value         operation
K1   2020-10-12 07:35:47    K1-073547       create
K1   2020-10-12 07:35:49    K1-073547       upload
K1   2020-10-12 07:35:50    
K1   2020-10-12 07:35:55    K1-073555       create   
K1   2020-10-12 07:35:58    K1-073555       upload
K1   2020-10-12 07:37:19    
KK   2020-10-13 08:11:09    KK-081109       create
KK   2020-10-13 08:11:09    KK-081109       create
KK   2020-10-13 08:11:12    KK-081109       upload       
KK   2020-10-13 08:11:15
KK   2020-10-13 08:11:25    KK-081125       create
KK   2020-10-13 08:11:26    KK-081125       upload  
我怎样才能在这种情况下进行转换

用于构建数据帧的代码:

ID          time             value       operation
K1   2020-10-12 07:35:47    K1-0735       create
K1   2020-10-12 07:35:49    K1-0735       upload
K1   2020-10-12 07:35:50    
K1   2020-10-12 07:35:55    K1-0735       create   
K1   2020-10-12 07:35:58    K1-0735       upload
K1   2020-10-12 07:37:19    
KK   2020-10-13 08:11:09    KK-0811       create
KK   2020-10-13 08:11:09    KK-0811       create
KK   2020-10-13 08:11:12    KK-0811       upload       
KK   2020-10-13 08:11:15
KK   2020-10-13 08:11:25    KK-0811       create
KK   2020-10-13 08:11:26    KK-0811       upload  
ID <- c("K1","K1","K1","K1","K1","K1","KK","KK","KK","KK","KK","KK")

time <- as.Date(c('2020-10-12 07:35:47','2020-10-12 07:35:49','2020-10-12 07:35:50',
                     '2020-10-12 07:35:55','2020-10-12 07:35:58', '2020-10-12 07:37:19',
                     '2020-10-13 08:11:09','2020-10-13 08:11:09','2020-10-13 08:11:12',
                     '2020-10-13 08:11:15','2020-10-13 08:11:25','2020-10-13 08:11:26'))

value <- c("K1-0735","K1-0735",NA,"K1-0735","K1-0735",NA,"KK-0811","KK-0811","KK-0811",
           NA,"KK-0811","KK-0811")

operation <- c("create", "upload", NA,"create", "upload", NA,"create","create", "upload",
               NA,"create", "upload")
data <- data.frame(ID, time, value,operation)

这里有一个有点麻烦的方法,但我认为它可以把你带到某个地方

该策略是在操作向量中查找开始/停止点

这是您的数据,请注意,我使用lubridate::as_datetime是因为您使用as.Date来删除时间戳。此外,stringsAsFactors=FALSE

我留下了额外的列来检查所有内容是否符合预期结果,您可以稍后选择所需的内容

# A tibble: 12 x 7
# Groups:   bundle [5]
   ID    time                value   operation bundle  seconds_to_paste new_id   
   <chr> <dttm>              <chr>   <chr>     <chr>              <dbl> <chr>    
 1 K1    2020-10-12 07:35:47 K1-0735 create    bundle1               47 K1-073547
 2 K1    2020-10-12 07:35:49 K1-0735 upload    bundle1               47 K1-073547
 3 K1    2020-10-12 07:35:50 NA      NA        NA                    50 NA       
 4 K1    2020-10-12 07:35:55 K1-0735 create    bundle2               55 K1-073555
 5 K1    2020-10-12 07:35:58 K1-0735 upload    bundle2               55 K1-073555
 6 K1    2020-10-12 07:37:19 NA      NA        NA                    50 NA       
 7 KK    2020-10-13 08:11:09 KK-0811 create    bundle3                9 KK-081109
 8 KK    2020-10-13 08:11:09 KK-0811 create    bundle3                9 KK-081109
 9 KK    2020-10-13 08:11:12 KK-0811 upload    bundle3                9 KK-081109
10 KK    2020-10-13 08:11:15 NA      NA        NA                    50 NA       
11 KK    2020-10-13 08:11:25 KK-0811 create    bundle4               25 KK-081125
12 KK    2020-10-13 08:11:26 KK-0811 upload    bundle4               25 KK-081125

我们可以通过使用data.table中的函数来完成很多工作。我将您的数据改为dat而不是data


下面是使用pivot_和自联接的另一种方法:

ID% 不同%>% 左joindata,by=cID,value,time=create%>% mutatevalue=case_ !is.navalue2&!is.navalue~value2, !is.nalagvalue2&!is.navalue~lagvalue2 %>% 选择-value2 >ID时间值操作 >1 K1 2020-10-12 07:35:47 K1-073547创建 >2 K1 2020-10-12 07:35:49 K1-073547上传 >3 K1 2020-10-12 07:35:50 >4 K1 2020-10-12 07:35:55 K1-073555创建 >5 K1 2020-10-12 07:35:58 K1-073555上传 >6 K1 2020-10-12 07:37:19 >7 KK 2020-10-13 08:11:09 KK-081109创建 >8 KK 2020-10-13 08:11:09 KK-081109创建 >9 KK 2020-10-13 08:11:12 KK-081109上传 >10 KK 2020-10-13 08:11:15 >11 KK 2020-10-13 08:11:25 KK-081125创建 >12 KK 2020-10-13 08:11:26 KK-081125上传
由v0.3.0于2021-01-14创建,请快速提问,值代码是否有两个字母,然后是破折号,然后是时间?@Tom值由ID列的值组成,如KK破折号小时和分钟,无空格连接。所以,如果ID中的行值为K1,时间戳为2020-10-12 07:35:47,那么值为K1-0735的时间格式是实际时间变量还是字符串?您能通过dput提供示例数据吗,谢谢@starja我添加了创建数据帧的代码。是时间有日期格式如果创建和上载在两分钟内发生什么情况?e、 g,创建时间是12:59:10,上载时间是13:00:01?您的输出与预期结果不符。没有索引,min_time ad value2转换现有列而不是创建新列不是更有效吗?我最初将这些列保留在那里以显示计算过程。我现在已经删除了它们,以更接近您想要的结果。我认为创建一个新的列会产生一个更简单的解决方案,更容易遵循编程面包屑。您不必处理现有列所包含的任何可能的低效率。基于valueindex=rleidvalue创建索引是错误的,因为我们可以在相同的分钟内使用不同的秒数创建两个创建上载操作包,并且它们将具有相同的索引
# A tibble: 12 x 7
# Groups:   bundle [5]
   ID    time                value   operation bundle  seconds_to_paste new_id   
   <chr> <dttm>              <chr>   <chr>     <chr>              <dbl> <chr>    
 1 K1    2020-10-12 07:35:47 K1-0735 create    bundle1               47 K1-073547
 2 K1    2020-10-12 07:35:49 K1-0735 upload    bundle1               47 K1-073547
 3 K1    2020-10-12 07:35:50 NA      NA        NA                    50 NA       
 4 K1    2020-10-12 07:35:55 K1-0735 create    bundle2               55 K1-073555
 5 K1    2020-10-12 07:35:58 K1-0735 upload    bundle2               55 K1-073555
 6 K1    2020-10-12 07:37:19 NA      NA        NA                    50 NA       
 7 KK    2020-10-13 08:11:09 KK-0811 create    bundle3                9 KK-081109
 8 KK    2020-10-13 08:11:09 KK-0811 create    bundle3                9 KK-081109
 9 KK    2020-10-13 08:11:12 KK-0811 upload    bundle3                9 KK-081109
10 KK    2020-10-13 08:11:15 NA      NA        NA                    50 NA       
11 KK    2020-10-13 08:11:25 KK-0811 create    bundle4               25 KK-081125
12 KK    2020-10-13 08:11:26 KK-0811 upload    bundle4               25 KK-081125
library(tidyverse)
library(data.table)
library(lubridate)

dat %>%
  mutate(index = rleid(value)) %>%
  group_by(index) %>%
  mutate(min_time = min(time)) %>%
  mutate(value2 = paste0(ID, "-", 
                         str_pad(hour(min_time), 2, pad = 0),
                         str_pad(minute(min_time), 2, pad = 0),
                         str_pad(second(min_time), 2, pad = 0))) %>%
  mutate(value2 = ifelse(is.na(value), NA, value2)) %>% 
  ungroup() %>%
  select(ID, time, value = value2, operation)

   ID    time                value     operation
   <chr> <dttm>              <chr>     <chr>    
 1 K1    2020-10-12 07:35:47 K1-073547 create   
 2 K1    2020-10-12 07:35:49 K1-073547 upload   
 3 K1    2020-10-12 07:35:50 NA        NA       
 4 K1    2020-10-12 07:35:55 K1-073555 create   
 5 K1    2020-10-12 07:35:58 K1-073555 upload   
 6 K1    2020-10-12 07:37:19 NA        NA       
 7 KK    2020-10-13 08:11:09 KK-081109 create   
 8 KK    2020-10-13 08:11:09 KK-081109 create   
 9 KK    2020-10-13 08:11:12 KK-081109 upload   
10 KK    2020-10-13 08:11:15 NA        NA       
11 KK    2020-10-13 08:11:25 KK-081125 create   
12 KK    2020-10-13 08:11:26 KK-081125 upload   
dat %>%
    mutate(row_sep = cumsum(is.na(value))) %>%
    group_by(row_sep) %>%
    mutate(min_time = min(as.POSIXct(ifelse(is.na(value), NA_POSIXct_, time),
                              origin = "1970-01-01 00:00.00 UTC"),
                       na.rm = TRUE)) %>%
    mutate(value2 = paste0(ID, "-", 
                           str_pad(hour(min_time), 2, pad = 0),
                           str_pad(minute(min_time), 2, pad = 0),
                           str_pad(second(min_time), 2, pad = 0))) %>%
    mutate(value2 = ifelse(is.na(value), NA, value2)) %>% 
    ungroup() %>%
    select(ID, time, value = value2, operation)

   ID    time                value     operation
   <chr> <dttm>              <chr>     <chr>    
 1 K1    2020-10-12 07:35:47 K1-073547 create   
 2 K1    2020-10-12 07:35:49 K1-073547 upload   
 3 K1    2020-10-12 07:35:50 NA        NA       
 4 K1    2020-10-12 07:35:55 K1-073555 create   
 5 K1    2020-10-12 07:35:58 K1-073555 upload   
 6 K1    2020-10-12 07:37:19 NA        NA       
 7 KK    2020-10-13 08:11:09 KK-081109 create   
 8 KK    2020-10-13 08:11:09 KK-081109 create   
 9 KK    2020-10-13 08:11:12 KK-081109 upload   
10 KK    2020-10-13 08:11:15 NA        NA       
11 KK    2020-10-13 08:11:25 KK-081125 create   
12 KK    2020-10-13 08:11:26 KK-081125 upload