R Join可根据项目ID提取表Y中大于表X中日期的最小日期_R_Dplyr

R Join可根据项目ID提取表Y中大于表X中日期的最小日期

R Join可根据项目ID提取表Y中大于表X中日期的最小日期,r,dplyr,R,Dplyr,我找到了许多类似问题的答案，但没有找到这个确切问题的答案。我觉得这应该很容易，但这让我的大脑受伤了模式是我有一张租金表和一张收益表。每件物品都可以多次租用和归还。幸运的是，这些都存储在不同的表中表十租金 ID Type Date Rented 0001 A 2017-02-01 0001 A 2017-07-01 0001 A 2017-09-01 0002 B 2017-01-01 0002 B 2017-05

我找到了许多类似问题的答案，但没有找到这个确切问题的答案。我觉得这应该很容易，但这让我的大脑受伤了

模式是我有一张租金表和一张收益表。每件物品都可以多次租用和归还。幸运的是，这些都存储在不同的表中

表十租金

ID    Type   Date Rented
0001  A       2017-02-01
0001  A       2017-07-01
0001  A       2017-09-01
0002  B       2017-01-01
0002  B       2017-05-01

表Y申报表

ID    Date Returned
0001  2017-05-01
0001  2017-08-01
0002  2017-04-01

我想以以下方式结束：

ID    Type    Date Rented    Date Returned
0001  A       2017-02-01     2017-05-01
0001  A       2017-07-01     2017-08-01
0001  A       2017-09-01     NA
0002  B       2017-01-01     2017-04-01
0002  B       2017-05-01     NA

所以我在寻找每个ID和租赁日期，返回表中大于该租赁日期的最小值

我将在R中处理输出，因此，如果在R/dplyr中比在SQL中更容易做到这一点，我洗耳恭听……

合并，如果返回日期早于租赁日期，则将返回日期设置为NA，然后分组并获取返回的最小日期

df1 = read.table(text = "
ID    Type   DateRented
0001  A       2017-02-01
0001  A       2017-07-01
0001  A       2017-09-01
0002  B       2017-01-01
0002  B       2017-05-01
", header=T)

df2 = read.table(text = "
ID    DateRented
0001  2017-05-01
0001  2017-08-01
0002  2017-04-01
", header=T)

library(dplyr)
library(lubridate)

# update to a date format and order by ID and date
# (not needed if you have already a date format and ascending order)
df1 = df1 %>% mutate(DateRented = ydm(DateRented)) %>% arrange(ID, DateRented)
df2 = df2 %>% mutate(DateRented = ydm(DateRented)) %>% arrange(ID, DateRented)

# add row ids for each ID to your datasets
df1 = df1 %>% group_by(ID) %>% mutate(row_id = row_number()) %>% ungroup()
df2 = df2 %>% group_by(ID) %>% mutate(row_id = row_number()) %>% ungroup()

# join datasets and remove row id column
left_join(df1, df2, by=c("ID","row_id")) %>% select(-row_id)

# # A tibble: 5 x 4
#    ID Type   DateRented.x DateRented.y
#   <int> <fctr> <date>       <date>     
# 1     1 A      2017-02-01   2017-05-01  
# 2     1 A      2017-07-01   2017-08-01  
# 3     1 A      2017-09-01   NA          
# 4     2 B      2017-01-01   2017-04-01  
# 5     2 B      2017-05-01   NA

library(dplyr)

left_join(x, y, by = "ID") %>% 
  mutate(DateReturned = if_else(DateReturned < DateRented, as.Date(NA), DateReturned)) %>% 
  group_by(ID, Type, DateRented) %>% 
  summarise(DateReturnedMin = min(DateReturned, na.rm = TRUE)) %>% 
  ungroup()
  
# # A tibble: 5 x 4
#      ID  Type DateRented DateReturnedMin
#   <int> <chr>     <date>          <date>
# 1     1     A 2017-02-01      2017-05-01
# 2     1     A 2017-07-01      2017-08-01
# 3     1     A 2017-09-01              NA
# 4     2     B 2017-01-01      2017-04-01
# 5     2     B 2017-05-01              NA

或者，如果我们更喜欢使用SQL，则使用sqldf包，逻辑与上面相同：

library(sqldf)

sqldf("select ID, Type, DateRented__Date, min(DateReturned__Date) as DateReturnedMin__Date
       from (
             select x.ID, Type, DateRented as DateRented__Date,
                   (case when (DateReturned < DateRented)         
                         then NULL 
                         else DateReturned
                    end) as DateReturned__Date
             from x, y
             where x.ID = y.ID) a
       group by ID, Type, DateRented__Date",
      method = "name__class")

# ID   Type DateRented DateReturnedMin
# 1  1    A 2017-02-01      2017-05-01
# 2  1    A 2017-07-01      2017-08-01
# 3  1    A 2017-09-01            <NA>
# 4  2    B 2017-01-01      2017-04-01
# 5  2    B 2017-05-01            <NA>

数据我们可以使用带有data.table的联接

我认为您需要提供一个更具代表性的示例，使用更多ID。我认为，如果两个数据集都是按日期排序的，那么您可以使用每个特定ID的行号进行联接。编辑后添加第二个ID，谢谢。不确定第二个数据集在表示返回时为何在列名中租用：我非常喜欢SQL方法。当我没有两个数据帧中的数据时，这似乎是直接从数据库中提取数据的最佳方法。在这种情况下，我没有这样的机会。注意：如果在租赁之前发生了物品退货，则此项中断。看起来这两个表具有基于相同截止日期的提取条件，因此有些项目的租赁日期早于截止日期，而返回日期在截止日期之后。在这些情况下，行不对齐。解决此问题的最佳方法是发布一个这样的示例，并指定您希望如何处理它。

x <- read.table(text = "
ID    Type   DateRented
0001  A       2017-02-01
0001  A       2017-07-01
0001  A       2017-09-01
0002  B       2017-01-01
0002  B       2017-05-01", header = TRUE, stringsAsFactors = FALSE)

y <- read.table(text = "
ID    DateReturned
0001  2017-05-01
0001  2017-08-01
0002  2017-04-01", header = TRUE, stringsAsFactors = FALSE)

# convert to date class
x$DateRented <- as.Date(x$DateRented, format = "%Y-%m-%d")
y$DateReturned <- as.Date(y$DateReturned, format = "%Y-%m-%d")

library(data.table)
setDT(X)[Y, DateReturned := DateReturned,on =.(ID, DateRented< DateReturned), mult = "last"]
X
#   ID Type DateRented DateReturned
#1:  1    A 2017-02-01   2017-05-01
#2:  1    A 2017-07-01   2017-08-01
#3:  1    A 2017-09-01         <NA>
#4:  2    B 2017-01-01   2017-04-01
#5:  2    B 2017-05-01         <NA>