R 快速识别时间戳中的不连续性_R_Dplyr

R 快速识别时间戳中的不连续性

R 快速识别时间戳中的不连续性,r,dplyr,R,Dplyr,我有一个大数据帧，在不同的文件中有上万个时间戳s，如以下示例（可复制集）：如何更快速地检查和识别错误的时间戳值数据： df可能使用diff？您的大多数代码似乎只是将时间戳文本转换为您可以实际过滤的时间戳。我建议您使用purr来整理所有文件，然后对组合文件应用这些处理步骤，这样您就可以有一个可用的格式来进行分析。（即最后一个过滤步骤） df line speaker utterance

我有一个大数据帧，在不同的

文件中有上万个时间戳
s，如以下示例（可复制集）：
如何更快速地检查和识别错误的时间戳值
数据：
df可能使用diff
？您的大多数代码似乎只是将时间戳文本转换为您可以实际过滤的时间戳。我建议您使用purr来整理所有文件，然后对组合文件应用这些处理步骤，这样您就可以有一个可用的格式来进行分析。（即最后一个过滤步骤）
df
    line speaker                                      utterance                   timestamp file
493 0247  ID03.A                     ↑he's moving to↑ Bru: ges= 00:04:57.517 - 00:04:58.832  F03
495 0248    <NA>                                        (0.148) 00:04:58.832 - 00:04:58.980  F03
497 0249  ID03.B                                       =↑a[:h.] 00:04:58.980 - 00:04:59.860  F03
499 0250  ID03.A                    [have you been] to Bruges?= 00:04:59.322 - 00:05:00.529  F03
501 0251    <NA>                                        (0.023) 00:05:00.529 - 00:05:00.552  F03
503 0252  ID03.B           =that's cute [no (but I know of it)] 00:05:00.552 - 00:05:02.420  F03
505 0253  ID03.A                 [it's so cu:te] so cute #yeah# 00:05:01.350 - 00:05:03.260  F03
507 0254    <NA>                                        (0.320) 00:01:03.260 - 00:05:03.580  F03 
509 0255  ID03.A but u::m tt anyway yeah I was writing him a:nd 00:05:03.580 - 00:05:07.430  F03

df %>%
  group_by(file) %>%
  mutate(
    starttime = str_extract(timestamp, "^.*(?=\\s-)"),
    endtime = str_extract(timestamp, "(?<=- ).*$"),
    starttime_ms = sapply(strsplit(starttime, ":"), function(x) 1000 * sum(c(3600,60,1) * as.numeric(x))),
    endtime_ms = sapply(strsplit(endtime, ":"), function(x) 1000 * sum(c(3600,60,1) * as.numeric(x))),
    duration = endtime_ms - starttime_ms) %>%
  filter(starttime_ms < lag(starttime_ms) | endtime_ms < starttime_ms)
# A tibble: 1 x 10
# Groups:   file [1]
  line  speaker utterance timestamp                   file  starttime    endtime      starttime_ms endtime_ms duration
  <chr> <chr>   <chr>     <chr>                       <chr> <chr>        <chr>               <dbl>      <dbl>    <dbl>
1 0254  NA      (0.320)   00:01:03.260 - 00:05:03.580 F03   00:01:03.260 00:05:03.580        63260     303580   240320

df <- structure(list(line = c("0247", "0248", "0249", "0250", "0251", 
                              "0252", "0253", "0254", "0255"), 
                     speaker = c("ID03.A", NA, "ID03.B", "ID03.A", NA, "ID03.B", "ID03.A", NA, "ID03.A"), 
                     utterance = c("↑he's moving to↑ Bru: ges=","(0.148)",
                                   "=↑a[:h.]", 
                                   "[have you been] to Bruges?=", 
                                   "(0.023)",
                                   "=that's cute [no (but I know of it)]", 
                                   "[it's so cu:te] so cute #yeah#",
                                   "(0.320)", 
                                   "but u::m tt anyway yeah I was writing him a:nd"), 
                     timestamp = c("00:04:57.517 - 00:04:58.832", "00:04:58.832 - 00:04:58.980", 
                                   "00:04:58.980 - 00:04:59.860", "00:04:59.322 - 00:05:00.529", 
                                   "00:05:00.529 - 00:05:00.552", "00:05:00.552 - 00:05:02.420", 
                                   "00:05:01.350 - 00:05:03.260", "00:01:03.260 - 00:05:03.580", 
                                   "00:05:03.580 - 00:05:07.430"), 
                     file = c("F03", "F03", "F03", "F03", "F03", "F03", "F03", "F03", "F03")), 
                row.names = c(493L, 495L, 497L, 499L, 501L, 503L, 505L, 507L, 509L), class = "data.frame")