Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 用区间模糊联接连接两个最接近开始时间的数据集_R_Dplyr_Tidyverse_Fuzzyjoin - Fatal编程技术网

R 用区间模糊联接连接两个最接近开始时间的数据集

R 用区间模糊联接连接两个最接近开始时间的数据集,r,dplyr,tidyverse,fuzzyjoin,R,Dplyr,Tidyverse,Fuzzyjoin,我试图用“fuzzyjoin:interval\u internal\u join”连接R中的两个大型数据集。我的目标是根据最近的开始和结束时间将这些数据加入到表中 # first dataset viewing <- data.frame(stringsAsFactors=FALSE, id = c("100-16", "100-16", "100-16", "100-16", "100-16", "10

我试图用“fuzzyjoin:interval\u internal\u join”连接R中的两个大型数据集。我的目标是根据最近的开始和结束时间将这些数据加入到表中

# first dataset

viewing <- data.frame(stringsAsFactors=FALSE,
                 id = c("100-16", "100-16", "100-16", "100-16", "100-16",
                        "100-16", "100-16", "100-16", "100-16", "100-16"),
      start_utc_day = c("2019-05-01", "2019-05-01", "2019-05-01", "2019-05-01",
                        "2019-05-01", "2019-05-01", "2019-05-01", "2019-05-01",
                        "2019-05-01", "2019-05-01"),
   start = c("7:18:45", "7:27:30", "7:59:30", "8:37:30", "8:41:15",
                        "8:47:15", "8:52:45", "8:55:30", "8:57:45", "9:05:00"),
     end = c("7:26:15", "7:59:15", "8:33:45", "8:40:30", "8:43:15",
                        "8:50:15", "8:55:15", "8:57:00", "9:00:00", "9:07:00")
)

# second dataset
location <- data.frame(stringsAsFactors=FALSE,
                 id = c("100-16", "100-16", "100-16", "100-16", "100-16",
                        "100-16", "100-16", "100-16", "100-16", "100-16"),
               code = c("IN", "IN", "IN", "IN", "IN", "IN", "IN", "IN", "IN",
                        "IN"),
            utc_day = c("2019-05-01", "2019-05-01", "2019-05-01", "2019-05-01",
                        "2019-05-01", "2019-05-01", "2019-05-01", "2019-05-01",
                        "2019-05-01", "2019-05-01"),
   start = c("7:13:30", "7:17:00", "7:22:00", "7:41:00", "8:14:15",
                        "8:33:45", "8:43:00", "9:08:45", "9:21:15", "9:32:00"),
     end = c("7:15:30", "7:20:30", "7:31:00", "7:43:00", "8:15:45",
                        "8:35:15", "8:45:30", "9:12:15", "9:23:00", "9:35:15")
)
但我犯了这个错误

Joining by: c("id", "start", "end")
索引匹配中出错(d1,d2):
interval\u join必须正好在两列(开始和结束)上连接


考虑创建适当的日期/时间字段,然后运行模糊联接。字符列不能用于数字间隔比较或匹配

viewing <- within(viewing, {
   end_dt_time <- as.POSIXct(paste(start_utc_day, end), format="%Y-%m-%d %H:%M:%S") 
   start_dt_time <- as.POSIXct(paste(start_utc_day, start), format="%Y-%m-%d %H:%M:%S")
})

location <- within(location, {
   end_dt_time <- as.POSIXct(paste(utc_day, end), format="%Y-%m-%d %H:%M:%S") 
   start_dt_time <- as.POSIXct(paste(utc_day, start), format="%Y-%m-%d %H:%M:%S")
})

interval_semi_join(viewing, location, by=c("start_dt_time", "end_dt_time"), minoverlap=3)

查看您的开始和结束当前是字符列,很难进行间隔匹配。谢谢,我原来的开始和结束变量是列(当我导出一个虚拟数据更改为字符时),所以模糊连接仅适用于变量,对吗?上述解决方案对您有效吗?文档似乎表明by可以是任何数字,包括日期/时间(在引擎盖下是从epoch或unix时间算起的秒)。不,这不是因为它似乎会删除id和代码变量。或者,您可以将每次取整为最接近的值,例如3分钟、6分钟等,并在id和时间戳上合并。
viewing <- within(viewing, {
   end_dt_time <- as.POSIXct(paste(start_utc_day, end), format="%Y-%m-%d %H:%M:%S") 
   start_dt_time <- as.POSIXct(paste(start_utc_day, start), format="%Y-%m-%d %H:%M:%S")
})

location <- within(location, {
   end_dt_time <- as.POSIXct(paste(utc_day, end), format="%Y-%m-%d %H:%M:%S") 
   start_dt_time <- as.POSIXct(paste(utc_day, start), format="%Y-%m-%d %H:%M:%S")
})

interval_semi_join(viewing, location, by=c("start_dt_time", "end_dt_time"), minoverlap=3)