R Join可根据项目ID提取表Y中大于表X中日期的最小日期
我找到了许多类似问题的答案,但没有找到这个确切问题的答案。我觉得这应该很容易,但这让我的大脑受伤了 模式是我有一张租金表和一张收益表。每件物品都可以多次租用和归还。幸运的是,这些都存储在不同的表中 表十租金R Join可根据项目ID提取表Y中大于表X中日期的最小日期,r,dplyr,R,Dplyr,我找到了许多类似问题的答案,但没有找到这个确切问题的答案。我觉得这应该很容易,但这让我的大脑受伤了 模式是我有一张租金表和一张收益表。每件物品都可以多次租用和归还。幸运的是,这些都存储在不同的表中 表十租金 ID Type Date Rented 0001 A 2017-02-01 0001 A 2017-07-01 0001 A 2017-09-01 0002 B 2017-01-01 0002 B 2017-05
ID Type Date Rented
0001 A 2017-02-01
0001 A 2017-07-01
0001 A 2017-09-01
0002 B 2017-01-01
0002 B 2017-05-01
表Y申报表
ID Date Returned
0001 2017-05-01
0001 2017-08-01
0002 2017-04-01
我想以以下方式结束:
ID Type Date Rented Date Returned
0001 A 2017-02-01 2017-05-01
0001 A 2017-07-01 2017-08-01
0001 A 2017-09-01 NA
0002 B 2017-01-01 2017-04-01
0002 B 2017-05-01 NA
所以我在寻找每个ID和租赁日期,返回表中大于该租赁日期的最小值
我将在R中处理输出,因此,如果在R/dplyr中比在SQL中更容易做到这一点,我洗耳恭听……合并,如果返回日期早于租赁日期,则将返回日期设置为NA,然后分组并获取返回的最小日期
df1 = read.table(text = "
ID Type DateRented
0001 A 2017-02-01
0001 A 2017-07-01
0001 A 2017-09-01
0002 B 2017-01-01
0002 B 2017-05-01
", header=T)
df2 = read.table(text = "
ID DateRented
0001 2017-05-01
0001 2017-08-01
0002 2017-04-01
", header=T)
library(dplyr)
library(lubridate)
# update to a date format and order by ID and date
# (not needed if you have already a date format and ascending order)
df1 = df1 %>% mutate(DateRented = ydm(DateRented)) %>% arrange(ID, DateRented)
df2 = df2 %>% mutate(DateRented = ydm(DateRented)) %>% arrange(ID, DateRented)
# add row ids for each ID to your datasets
df1 = df1 %>% group_by(ID) %>% mutate(row_id = row_number()) %>% ungroup()
df2 = df2 %>% group_by(ID) %>% mutate(row_id = row_number()) %>% ungroup()
# join datasets and remove row id column
left_join(df1, df2, by=c("ID","row_id")) %>% select(-row_id)
# # A tibble: 5 x 4
# ID Type DateRented.x DateRented.y
# <int> <fctr> <date> <date>
# 1 1 A 2017-02-01 2017-05-01
# 2 1 A 2017-07-01 2017-08-01
# 3 1 A 2017-09-01 NA
# 4 2 B 2017-01-01 2017-04-01
# 5 2 B 2017-05-01 NA
library(dplyr)
left_join(x, y, by = "ID") %>%
mutate(DateReturned = if_else(DateReturned < DateRented, as.Date(NA), DateReturned)) %>%
group_by(ID, Type, DateRented) %>%
summarise(DateReturnedMin = min(DateReturned, na.rm = TRUE)) %>%
ungroup()
# # A tibble: 5 x 4
# ID Type DateRented DateReturnedMin
# <int> <chr> <date> <date>
# 1 1 A 2017-02-01 2017-05-01
# 2 1 A 2017-07-01 2017-08-01
# 3 1 A 2017-09-01 NA
# 4 2 B 2017-01-01 2017-04-01
# 5 2 B 2017-05-01 NA
或者,如果我们更喜欢使用SQL,则使用sqldf包,逻辑与上面相同:
library(sqldf)
sqldf("select ID, Type, DateRented__Date, min(DateReturned__Date) as DateReturnedMin__Date
from (
select x.ID, Type, DateRented as DateRented__Date,
(case when (DateReturned < DateRented)
then NULL
else DateReturned
end) as DateReturned__Date
from x, y
where x.ID = y.ID) a
group by ID, Type, DateRented__Date",
method = "name__class")
# ID Type DateRented DateReturnedMin
# 1 1 A 2017-02-01 2017-05-01
# 2 1 A 2017-07-01 2017-08-01
# 3 1 A 2017-09-01 <NA>
# 4 2 B 2017-01-01 2017-04-01
# 5 2 B 2017-05-01 <NA>
数据
我们可以使用带有data.table的联接
我认为您需要提供一个更具代表性的示例,使用更多ID。我认为,如果两个数据集都是按日期排序的,那么您可以使用每个特定ID的行号进行联接。编辑后添加第二个ID,谢谢。不确定第二个数据集在表示返回时为何在列名中租用:我非常喜欢SQL方法。当我没有两个数据帧中的数据时,这似乎是直接从数据库中提取数据的最佳方法。在这种情况下,我没有这样的机会。注意:如果在租赁之前发生了物品退货,则此项中断。看起来这两个表具有基于相同截止日期的提取条件,因此有些项目的租赁日期早于截止日期,而返回日期在截止日期之后。在这些情况下,行不对齐。解决此问题的最佳方法是发布一个这样的示例,并指定您希望如何处理它。
x <- read.table(text = "
ID Type DateRented
0001 A 2017-02-01
0001 A 2017-07-01
0001 A 2017-09-01
0002 B 2017-01-01
0002 B 2017-05-01", header = TRUE, stringsAsFactors = FALSE)
y <- read.table(text = "
ID DateReturned
0001 2017-05-01
0001 2017-08-01
0002 2017-04-01", header = TRUE, stringsAsFactors = FALSE)
# convert to date class
x$DateRented <- as.Date(x$DateRented, format = "%Y-%m-%d")
y$DateReturned <- as.Date(y$DateReturned, format = "%Y-%m-%d")
library(data.table)
setDT(X)[Y, DateReturned := DateReturned,on =.(ID, DateRented< DateReturned), mult = "last"]
X
# ID Type DateRented DateReturned
#1: 1 A 2017-02-01 2017-05-01
#2: 1 A 2017-07-01 2017-08-01
#3: 1 A 2017-09-01 <NA>
#4: 2 B 2017-01-01 2017-04-01
#5: 2 B 2017-05-01 <NA>