R 如何根据其他列的值对列中的字符串进行复杂的转换?
我有一个数据帧:R 如何根据其他列的值对列中的字符串进行复杂的转换?,r,function,dataframe,join,R,Function,Dataframe,Join,我有一个数据帧: ID time value operation K1 2020-10-12 07:35:47 K1-0735 create K1 2020-10-12 07:35:49 K1-0735 upload K1 2020-10-12 07:35:50 K1 2020-10-12 07:35:55 K1-0735 create K1 2020-1
ID time value operation
K1 2020-10-12 07:35:47 K1-0735 create
K1 2020-10-12 07:35:49 K1-0735 upload
K1 2020-10-12 07:35:50
K1 2020-10-12 07:35:55 K1-0735 create
K1 2020-10-12 07:35:58 K1-0735 upload
K1 2020-10-12 07:37:19
KK 2020-10-13 08:11:09 KK-0811 create
KK 2020-10-13 08:11:09 KK-0811 create
KK 2020-10-13 08:11:12 KK-0811 upload
KK 2020-10-13 08:11:15
KK 2020-10-13 08:11:25 KK-0811 create
KK 2020-10-13 08:11:26 KK-0811 upload
ID <- c("K1","K1","K1","K1","K1","K1","KK","KK","KK","KK","KK","KK")
time <- as.Date(c('2020-10-12 07:35:47','2020-10-12 07:35:49','2020-10-12 07:35:50',
'2020-10-12 07:35:55','2020-10-12 07:35:58', '2020-10-12 07:37:19',
'2020-10-13 08:11:09','2020-10-13 08:11:09','2020-10-13 08:11:12',
'2020-10-13 08:11:15','2020-10-13 08:11:25','2020-10-13 08:11:26'))
value <- c("K1-0735","K1-0735",NA,"K1-0735","K1-0735",NA,"KK-0811","KK-0811","KK-0811",
NA,"KK-0811","KK-0811")
operation <- c("create", "upload", NA,"create", "upload", NA,"create","create", "upload",
NA,"create", "upload")
data <- data.frame(ID, time, value,operation)
正如您所看到的,列值是列ID的值和时间戳中的小时和分钟的连接,不带空格。我想在该值上加上秒数,使其在一分钟内唯一。但是,它们必须与创建和上载值的“操作包”列中的秒数相同
因此,期望的结果是:
ID time value operation
K1 2020-10-12 07:35:47 K1-073547 create
K1 2020-10-12 07:35:49 K1-073547 upload
K1 2020-10-12 07:35:50
K1 2020-10-12 07:35:55 K1-073555 create
K1 2020-10-12 07:35:58 K1-073555 upload
K1 2020-10-12 07:37:19
KK 2020-10-13 08:11:09 KK-081109 create
KK 2020-10-13 08:11:09 KK-081109 create
KK 2020-10-13 08:11:12 KK-081109 upload
KK 2020-10-13 08:11:15
KK 2020-10-13 08:11:25 KK-081125 create
KK 2020-10-13 08:11:26 KK-081125 upload
我怎样才能在这种情况下进行转换
用于构建数据帧的代码:
ID time value operation
K1 2020-10-12 07:35:47 K1-0735 create
K1 2020-10-12 07:35:49 K1-0735 upload
K1 2020-10-12 07:35:50
K1 2020-10-12 07:35:55 K1-0735 create
K1 2020-10-12 07:35:58 K1-0735 upload
K1 2020-10-12 07:37:19
KK 2020-10-13 08:11:09 KK-0811 create
KK 2020-10-13 08:11:09 KK-0811 create
KK 2020-10-13 08:11:12 KK-0811 upload
KK 2020-10-13 08:11:15
KK 2020-10-13 08:11:25 KK-0811 create
KK 2020-10-13 08:11:26 KK-0811 upload
ID <- c("K1","K1","K1","K1","K1","K1","KK","KK","KK","KK","KK","KK")
time <- as.Date(c('2020-10-12 07:35:47','2020-10-12 07:35:49','2020-10-12 07:35:50',
'2020-10-12 07:35:55','2020-10-12 07:35:58', '2020-10-12 07:37:19',
'2020-10-13 08:11:09','2020-10-13 08:11:09','2020-10-13 08:11:12',
'2020-10-13 08:11:15','2020-10-13 08:11:25','2020-10-13 08:11:26'))
value <- c("K1-0735","K1-0735",NA,"K1-0735","K1-0735",NA,"KK-0811","KK-0811","KK-0811",
NA,"KK-0811","KK-0811")
operation <- c("create", "upload", NA,"create", "upload", NA,"create","create", "upload",
NA,"create", "upload")
data <- data.frame(ID, time, value,operation)
这里有一个有点麻烦的方法,但我认为它可以把你带到某个地方 该策略是在操作向量中查找开始/停止点 这是您的数据,请注意,我使用lubridate::as_datetime是因为您使用as.Date来删除时间戳。此外,stringsAsFactors=FALSE 我留下了额外的列来检查所有内容是否符合预期结果,您可以稍后选择所需的内容
# A tibble: 12 x 7
# Groups: bundle [5]
ID time value operation bundle seconds_to_paste new_id
<chr> <dttm> <chr> <chr> <chr> <dbl> <chr>
1 K1 2020-10-12 07:35:47 K1-0735 create bundle1 47 K1-073547
2 K1 2020-10-12 07:35:49 K1-0735 upload bundle1 47 K1-073547
3 K1 2020-10-12 07:35:50 NA NA NA 50 NA
4 K1 2020-10-12 07:35:55 K1-0735 create bundle2 55 K1-073555
5 K1 2020-10-12 07:35:58 K1-0735 upload bundle2 55 K1-073555
6 K1 2020-10-12 07:37:19 NA NA NA 50 NA
7 KK 2020-10-13 08:11:09 KK-0811 create bundle3 9 KK-081109
8 KK 2020-10-13 08:11:09 KK-0811 create bundle3 9 KK-081109
9 KK 2020-10-13 08:11:12 KK-0811 upload bundle3 9 KK-081109
10 KK 2020-10-13 08:11:15 NA NA NA 50 NA
11 KK 2020-10-13 08:11:25 KK-0811 create bundle4 25 KK-081125
12 KK 2020-10-13 08:11:26 KK-0811 upload bundle4 25 KK-081125
我们可以通过使用data.table中的函数来完成很多工作。我将您的数据改为dat而不是data
下面是使用pivot_和自联接的另一种方法: ID% 不同%>% 左joindata,by=cID,value,time=create%>% mutatevalue=case_ !is.navalue2&!is.navalue~value2, !is.nalagvalue2&!is.navalue~lagvalue2 %>% 选择-value2 >ID时间值操作 >1 K1 2020-10-12 07:35:47 K1-073547创建 >2 K1 2020-10-12 07:35:49 K1-073547上传 >3 K1 2020-10-12 07:35:50 >4 K1 2020-10-12 07:35:55 K1-073555创建 >5 K1 2020-10-12 07:35:58 K1-073555上传 >6 K1 2020-10-12 07:37:19 >7 KK 2020-10-13 08:11:09 KK-081109创建 >8 KK 2020-10-13 08:11:09 KK-081109创建 >9 KK 2020-10-13 08:11:12 KK-081109上传 >10 KK 2020-10-13 08:11:15 >11 KK 2020-10-13 08:11:25 KK-081125创建 >12 KK 2020-10-13 08:11:26 KK-081125上传
由v0.3.0于2021-01-14创建,请快速提问,值代码是否有两个字母,然后是破折号,然后是时间?@Tom值由ID列的值组成,如KK破折号小时和分钟,无空格连接。所以,如果ID中的行值为K1,时间戳为2020-10-12 07:35:47,那么值为K1-0735的时间格式是实际时间变量还是字符串?您能通过dput提供示例数据吗,谢谢@starja我添加了创建数据帧的代码。是时间有日期格式如果创建和上载在两分钟内发生什么情况?e、 g,创建时间是12:59:10,上载时间是13:00:01?您的输出与预期结果不符。没有索引,min_time ad value2转换现有列而不是创建新列不是更有效吗?我最初将这些列保留在那里以显示计算过程。我现在已经删除了它们,以更接近您想要的结果。我认为创建一个新的列会产生一个更简单的解决方案,更容易遵循编程面包屑。您不必处理现有列所包含的任何可能的低效率。基于valueindex=rleidvalue创建索引是错误的,因为我们可以在相同的分钟内使用不同的秒数创建两个创建上载操作包,并且它们将具有相同的索引
# A tibble: 12 x 7
# Groups: bundle [5]
ID time value operation bundle seconds_to_paste new_id
<chr> <dttm> <chr> <chr> <chr> <dbl> <chr>
1 K1 2020-10-12 07:35:47 K1-0735 create bundle1 47 K1-073547
2 K1 2020-10-12 07:35:49 K1-0735 upload bundle1 47 K1-073547
3 K1 2020-10-12 07:35:50 NA NA NA 50 NA
4 K1 2020-10-12 07:35:55 K1-0735 create bundle2 55 K1-073555
5 K1 2020-10-12 07:35:58 K1-0735 upload bundle2 55 K1-073555
6 K1 2020-10-12 07:37:19 NA NA NA 50 NA
7 KK 2020-10-13 08:11:09 KK-0811 create bundle3 9 KK-081109
8 KK 2020-10-13 08:11:09 KK-0811 create bundle3 9 KK-081109
9 KK 2020-10-13 08:11:12 KK-0811 upload bundle3 9 KK-081109
10 KK 2020-10-13 08:11:15 NA NA NA 50 NA
11 KK 2020-10-13 08:11:25 KK-0811 create bundle4 25 KK-081125
12 KK 2020-10-13 08:11:26 KK-0811 upload bundle4 25 KK-081125
library(tidyverse)
library(data.table)
library(lubridate)
dat %>%
mutate(index = rleid(value)) %>%
group_by(index) %>%
mutate(min_time = min(time)) %>%
mutate(value2 = paste0(ID, "-",
str_pad(hour(min_time), 2, pad = 0),
str_pad(minute(min_time), 2, pad = 0),
str_pad(second(min_time), 2, pad = 0))) %>%
mutate(value2 = ifelse(is.na(value), NA, value2)) %>%
ungroup() %>%
select(ID, time, value = value2, operation)
ID time value operation
<chr> <dttm> <chr> <chr>
1 K1 2020-10-12 07:35:47 K1-073547 create
2 K1 2020-10-12 07:35:49 K1-073547 upload
3 K1 2020-10-12 07:35:50 NA NA
4 K1 2020-10-12 07:35:55 K1-073555 create
5 K1 2020-10-12 07:35:58 K1-073555 upload
6 K1 2020-10-12 07:37:19 NA NA
7 KK 2020-10-13 08:11:09 KK-081109 create
8 KK 2020-10-13 08:11:09 KK-081109 create
9 KK 2020-10-13 08:11:12 KK-081109 upload
10 KK 2020-10-13 08:11:15 NA NA
11 KK 2020-10-13 08:11:25 KK-081125 create
12 KK 2020-10-13 08:11:26 KK-081125 upload
dat %>%
mutate(row_sep = cumsum(is.na(value))) %>%
group_by(row_sep) %>%
mutate(min_time = min(as.POSIXct(ifelse(is.na(value), NA_POSIXct_, time),
origin = "1970-01-01 00:00.00 UTC"),
na.rm = TRUE)) %>%
mutate(value2 = paste0(ID, "-",
str_pad(hour(min_time), 2, pad = 0),
str_pad(minute(min_time), 2, pad = 0),
str_pad(second(min_time), 2, pad = 0))) %>%
mutate(value2 = ifelse(is.na(value), NA, value2)) %>%
ungroup() %>%
select(ID, time, value = value2, operation)
ID time value operation
<chr> <dttm> <chr> <chr>
1 K1 2020-10-12 07:35:47 K1-073547 create
2 K1 2020-10-12 07:35:49 K1-073547 upload
3 K1 2020-10-12 07:35:50 NA NA
4 K1 2020-10-12 07:35:55 K1-073555 create
5 K1 2020-10-12 07:35:58 K1-073555 upload
6 K1 2020-10-12 07:37:19 NA NA
7 KK 2020-10-13 08:11:09 KK-081109 create
8 KK 2020-10-13 08:11:09 KK-081109 create
9 KK 2020-10-13 08:11:12 KK-081109 upload
10 KK 2020-10-13 08:11:15 NA NA
11 KK 2020-10-13 08:11:25 KK-081125 create
12 KK 2020-10-13 08:11:26 KK-081125 upload