如何使用R中的spread()和gather()函数将给定的预定行程数据集重建为所需的链接行程数据集?
我有一个预订的旅行数据集,如下所示:如何使用R中的spread()和gather()函数将给定的预定行程数据集重建为所需的链接行程数据集?,r,tidyverse,tidyr,sf,spread,R,Tidyverse,Tidyr,Sf,Spread,我有一个预订的旅行数据集,如下所示: bktrips <- data.frame( userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"), mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"), Origin = c("O1", "O2", "O3", "O4"
bktrips <- data.frame(
userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"),
mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"),
Origin = c("O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9"),
Destination = c("D1", "D2", "D3", "D4", "D5", "D6", "D7","D8", "D9" ),
depart_dt = c("2019-11-05 8:00:00","2019-11-05 8:30:00", "2019-11-05 11:00:00", "2019-11-05 11:40:00", "2019-11-06 8:00:00", "2019-11-06 9:10:00", "2019-11-07 8:00:00", "2019-11-08 8:00:00", "2019-11-08 8:50:00"),
Olat = c("-33.87085", "-33.87138", "-33.79504", "-33.87832", "-33.89158", "-33.88993", "-33.89173", "-33.88573", "-33.88505"),
Olon = c("151.2073", "151.2039", "151.2737", "151.2174","151.2485", "151.2805","151.2469", "151.2169","151.2156"),
Dlat = c("-33.87372", "-33.87384", "-33.88323", "-33.89165", "-33.88993", "-33.89177", "-33.88573", "-33.87731", "-33.88573"),
Dlon = c("151.1957", "151.2126", "151.2175", "151.2471","151.2471", "151.2805","151.2514", "151.2175","151.2169")
)
bktrips这里是一种使用dplyr
和geosphere
计算距离的方法。我使用lubridate
来确定您的日期列
首先,我们修复列的类。其次,我们依赖这样一个事实,即旅行必须按时间顺序进行。因此,我们使用lag
fromdplyr
和distHaversine
fromgeosphere
计算距离上一个目的地的距离,以及自上次出发以来的时间
library(dplyr)
library(geosphere)
library(lubridate)
bktrips %>%
mutate(depart_dt = ymd_hms(depart_dt)) %>%
mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
group_by(userID) %>%
arrange(depart_dt,.by_group = TRUE) %>%
mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
dplyr::select(-depart_dt,-contains(c("lat","lon")))
userID mode Origin Destination DistPrevDest TimePrevDep
<fct> <fct> <fct> <fct> <dbl> <drtn>
1 P001 bus O1 D1 NA NA mins
2 P001 train O2 D2 801. 30 mins
3 P001 taxi O3 D3 10434. 150 mins
4 P001 bus O4 D4 547. 40 mins
5 P001 train O5 D5 130. 1220 mins
6 P002 taxi O6 D6 NA NA mins
7 P002 bus O7 D7 3105. 1370 mins
8 P002 train O8 D8 3188. 1440 mins
9 P002 taxi O9 D9 879. 50 mins
我建议您在数据中也包括到达时间,而是计算出发时间和前一次到达时间之间的差异
编辑:
缺少一个cumsum()
。现在修好了。另外,不再需要rleid
。我不清楚你想用它去哪里,但这里是计算每组行程距离和时间的开始(通过用户ID)。我必须快速找到一个软件包来计算从经纬度到地球的距离,然后找到了地球圈
。
希望这有帮助
library(dplyr)
library(tibble)
library(geosphere)
bktrips <- tibble(
userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"),
mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"),
Origin = c("O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9"),
Destination = c("D1", "D2", "D3", "D4", "D5", "D6", "D7","D8", "D9" ),
depart_dt = c("2019-11-05 8:00:00","2019-11-05 8:30:00", "2019-11-05 11:00:00", "2019-11-05 11:40:00", "2019-11-06 8:00:00", "2019-11-06 9:10:00", "2019-11-07 8:00:00", "2019-11-08 8:00:00", "2019-11-08 8:50:00"),
Olat = c("-33.87085", "-33.87138", "-33.79504", "-33.87832", "-33.89158", "-33.88993", "-33.89173", "-33.88573", "-33.88505"),
Olon = c("151.2073", "151.2039", "151.2737", "151.2174","151.2485", "151.2805","151.2469", "151.2169","151.2156"),
Dlat = c("-33.87372", "-33.87384", "-33.88323", "-33.89165", "-33.88993", "-33.89177", "-33.88573", "-33.87731", "-33.88573"),
Dlon = c("151.1957", "151.2126", "151.2175", "151.2471","151.2471", "151.2805","151.2514", "151.2175","151.2169")
)
bktrips <- bktrips %>%
mutate(depart_dt = as.POSIXct(depart_dt, format = "%Y-%m-%d %H:%M:%S"),
Olat = as.numeric(Olat),
Olon = as.numeric(Olon),
Dlat = as.numeric(Dlat),
Dlon = as.numeric(Dlon)) %>%
group_by(userID) %>%
mutate(trip_time = as.numeric(depart_dt - lag(depart_dt), units = 'mins')) %>%
rowwise() %>%
mutate(trip_distance = distm(x = c(Olon, Olat), y = c(Dlon, Dlat), fun = distHaversine))
> bktrips
Source: local data frame [9 x 11]
Groups: <by row>
# A tibble: 9 x 11
userID mode Origin Destination depart_dt Olat Olon Dlat Dlon trip_time trip_distance
<chr> <chr> <chr> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 P001 bus O1 D1 2019-11-05 08:00:00 -33.9 151. -33.9 151. NA 1119.
2 P001 train O2 D2 2019-11-05 08:30:00 -33.9 151. -33.9 151. 30 849.
3 P001 taxi O3 D3 2019-11-05 11:00:00 -33.8 151. -33.9 151. 150 11108.
4 P001 bus O4 D4 2019-11-05 11:40:00 -33.9 151. -33.9 151. 40 3120.
5 P001 train O5 D5 2019-11-06 08:00:00 -33.9 151. -33.9 151. 1220 225.
6 P002 taxi O6 D6 2019-11-06 09:10:00 -33.9 151. -33.9 151. NA 205.
7 P002 bus O7 D7 2019-11-07 08:00:00 -33.9 151. -33.9 151. 1370 787.
8 P002 train O8 D8 2019-11-08 08:00:00 -33.9 151. -33.9 151. 1440 939.
9 P002 taxi O9 D9 2019-11-08 08:50:00 -33.9 151. -33.9 151. 50 142.
库(dplyr)
图书馆(tibble)
图书馆(地球圈)
bktrips%
分组人(用户ID)%>%
变异(行程时间=as.numeric(出发时间-lag(出发时间),单位='mins'))%>%
行()
变异(trip_distance=distm(x=c(Olon,Olat),y=c(Dlon,Dlat),fun=distHaversine))
>bktrips
来源:本地数据帧[9 x 11]
组:
#一个tibble:9x11
用户识别码模式始发地目的地出发地点到达时间行程距离
1 P001总线O1 D1 2019-11-05 08:00:00-33.9 151-33.9 151. NA 1119。
2 P001列车O2 D2 2019-11-05 08:30:00-33.9 151-33.9 151. 30 849.
3 P001出租车O3 D32019-11-05 11:00:00-33.8 151-33.9 151. 150 11108.
4 P001总线O4 D4 2019-11-05 11:40:00-33.9 151-33.9 151. 40 3120.
5 P001列车O5 D5 2019-11-06 08:00:00-33.9 151-33.9 151. 1220 225.
6 P002出租车O6 D6 2019-11-06 09:10:00-33.9 151-33.9 151. NA 205。
7 P002巴士O7 D7 2019-11-07 08:00:00-33.9 151-33.9 151. 1370 787.
8 P002列车O8 D8 2019-11-08 08:00:00-33.9 151-33.9 151. 1440 939.
9 P002出租车O9 D9 2019-11-08 08:50:00-33.9 151-33.9 151. 50 142.
感谢Ben的精彩编辑。你能帮我解决这个问题吗?上次旅行的终点(到达日期/时间)在哪里?谢谢你,爱德华。根据我的实际数据,大部分到达时间都没有了。亲爱的伊恩,非常感谢你的出色工作。你在这里做的工作是正确的,这符合我对这个问题的期望。此外,在我的实际数据集中,大部分到达时间都丢失了。这就是为什么我只需要在出发时间工作。很多爱。很高兴它为你工作!在一点点反馈中,我花了很长时间与distHaversine
的错误结果作斗争,因为lat和long值是因子,并且被错误地强制为整数。以后,请尝试使用dput(bktrips)
提供列已经是正确类的示例数据。非常感谢Paul的建议和支持。
library(dplyr)
library(tibble)
library(geosphere)
bktrips <- tibble(
userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"),
mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"),
Origin = c("O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9"),
Destination = c("D1", "D2", "D3", "D4", "D5", "D6", "D7","D8", "D9" ),
depart_dt = c("2019-11-05 8:00:00","2019-11-05 8:30:00", "2019-11-05 11:00:00", "2019-11-05 11:40:00", "2019-11-06 8:00:00", "2019-11-06 9:10:00", "2019-11-07 8:00:00", "2019-11-08 8:00:00", "2019-11-08 8:50:00"),
Olat = c("-33.87085", "-33.87138", "-33.79504", "-33.87832", "-33.89158", "-33.88993", "-33.89173", "-33.88573", "-33.88505"),
Olon = c("151.2073", "151.2039", "151.2737", "151.2174","151.2485", "151.2805","151.2469", "151.2169","151.2156"),
Dlat = c("-33.87372", "-33.87384", "-33.88323", "-33.89165", "-33.88993", "-33.89177", "-33.88573", "-33.87731", "-33.88573"),
Dlon = c("151.1957", "151.2126", "151.2175", "151.2471","151.2471", "151.2805","151.2514", "151.2175","151.2169")
)
bktrips <- bktrips %>%
mutate(depart_dt = as.POSIXct(depart_dt, format = "%Y-%m-%d %H:%M:%S"),
Olat = as.numeric(Olat),
Olon = as.numeric(Olon),
Dlat = as.numeric(Dlat),
Dlon = as.numeric(Dlon)) %>%
group_by(userID) %>%
mutate(trip_time = as.numeric(depart_dt - lag(depart_dt), units = 'mins')) %>%
rowwise() %>%
mutate(trip_distance = distm(x = c(Olon, Olat), y = c(Dlon, Dlat), fun = distHaversine))
> bktrips
Source: local data frame [9 x 11]
Groups: <by row>
# A tibble: 9 x 11
userID mode Origin Destination depart_dt Olat Olon Dlat Dlon trip_time trip_distance
<chr> <chr> <chr> <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 P001 bus O1 D1 2019-11-05 08:00:00 -33.9 151. -33.9 151. NA 1119.
2 P001 train O2 D2 2019-11-05 08:30:00 -33.9 151. -33.9 151. 30 849.
3 P001 taxi O3 D3 2019-11-05 11:00:00 -33.8 151. -33.9 151. 150 11108.
4 P001 bus O4 D4 2019-11-05 11:40:00 -33.9 151. -33.9 151. 40 3120.
5 P001 train O5 D5 2019-11-06 08:00:00 -33.9 151. -33.9 151. 1220 225.
6 P002 taxi O6 D6 2019-11-06 09:10:00 -33.9 151. -33.9 151. NA 205.
7 P002 bus O7 D7 2019-11-07 08:00:00 -33.9 151. -33.9 151. 1370 787.
8 P002 train O8 D8 2019-11-08 08:00:00 -33.9 151. -33.9 151. 1440 939.
9 P002 taxi O9 D9 2019-11-08 08:50:00 -33.9 151. -33.9 151. 50 142.