Sql dplyr左_连接小于、大于条件
这个问题在某种程度上与问题和解决方案有关。我在这里发布了一个问题,询问该功能是否存在: 我希望使用Sql dplyr左_连接小于、大于条件,sql,r,postgresql,left-join,dplyr,Sql,R,Postgresql,Left Join,Dplyr,这个问题在某种程度上与问题和解决方案有关。我在这里发布了一个问题,询问该功能是否存在: 我希望使用dplyr::left\u join()连接两个数据帧。我用于联接的条件小于、大于,即。dplyr::left_join()是否支持此功能?或者,这些键之间是否只包含=运算符。这很容易从SQL运行(假设数据库中有dataframe) 这里是一个MWE:我有两个数据集,一个是公司年度(fdata),而第二个是每五年发生一次的调查数据。因此,对于fdata中处于两个调查年之间的所有年份,我加入相应的调
dplyr::left\u join()
连接两个数据帧。我用于联接的条件小于、大于,即
。dplyr::left_join()
是否支持此功能?或者,这些键之间是否只包含=
运算符。这很容易从SQL运行(假设数据库中有dataframe)
这里是一个MWE:我有两个数据集,一个是公司年度(fdata
),而第二个是每五年发生一次的调查数据。因此,对于fdata
中处于两个调查年之间的所有年份,我加入相应的调查年数据
id <- c(1,1,1,1,
2,2,2,2,2,2,
3,3,3,3,3,3,
5,5,5,5,
8,8,8,8,
13,13,13)
fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
1998,1999,2000,2001,1998,1999,2000)
byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)
sdata <- tbl_df(data.frame(byear, eyear, val))
fdata <- tbl_df(data.frame(id, fyear))
test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))
除非if
left\u join
可以处理该条件,但我的语法缺少某些内容?一个选项是将行作为列表列进行连接,然后取消对列的测试:
# evaluate each row individually
fdata %>% rowwise() %>%
# insert list column of single row of sdata based on conditions
mutate(s = list(sdata %>% filter(fyear >= byear, fyear < eyear))) %>%
# unnest list column
tidyr::unnest()
# Source: local data frame [27 x 5]
#
# id fyear byear eyear val
# (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 1998 1995 2000 1
# 2 1 1999 1995 2000 1
# 3 1 2000 2000 2005 5
# 4 1 2001 2000 2005 5
# 5 2 1998 1995 2000 1
# 6 2 1999 1995 2000 1
# 7 2 2000 2000 2005 5
# 8 2 2001 2000 2005 5
# 9 2 2002 2000 2005 5
# 10 2 2003 2000 2005 5
# .. ... ... ... ... ...
#分别评估每一行
fdata%%>%rowwise()%%>%
#根据条件插入sdata单行的列表列
变异(s=列表(sdata%>%过滤器(fyear>=byear,fyear%
#未列出列表列
tidyr::unnest()
#来源:本地数据帧[27 x 5]
#
#id fyear byear eyear val
#(dbl)(dbl)(dbl)(dbl)(dbl)(dbl)
# 1 1 1998 1995 2000 1
# 2 1 1999 1995 2000 1
# 3 1 2000 2000 2005 5
# 4 1 2001 2000 2005 5
# 5 2 1998 1995 2000 1
# 6 2 1999 1995 2000 1
# 7 2 2000 2000 2005 5
# 8 2 2001 2000 2005 5
# 9 2 2002 2000 2005 5
# 10 2 2003 2000 2005 5
# .. ... ... ... ... ...
数据。表添加了从V1.9.8开始的非等联接
library(data.table) #v>=1.9.8
setDT(sdata); setDT(fdata) # converting to data.table in place
fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,
.(id, x.fyear, byear, eyear, val)]
# id x.fyear byear eyear val
# 1: 1 1998 1995 2000 1
# 2: 2 1998 1995 2000 1
# 3: 3 1998 1995 2000 1
# 4: 5 1998 1995 2000 1
# 5: 8 1998 1995 2000 1
# 6: 13 1998 1995 2000 1
# 7: 1 1999 1995 2000 1
# 8: 2 1999 1995 2000 1
# 9: 3 1999 1995 2000 1
#10: 5 1999 1995 2000 1
#11: 8 1999 1995 2000 1
#12: 13 1999 1995 2000 1
#13: 1 2000 2000 2005 5
#14: 2 2000 2000 2005 5
#15: 3 2000 2000 2005 5
#16: 5 2000 2000 2005 5
#17: 8 2000 2000 2005 5
#18: 13 2000 2000 2005 5
#19: 1 2001 2000 2005 5
#20: 2 2001 2000 2005 5
#21: 3 2001 2000 2005 5
#22: 5 2001 2000 2005 5
#23: 8 2001 2000 2005 5
#24: 2 2002 2000 2005 5
#25: 3 2002 2000 2005 5
#26: 2 2003 2000 2005 5
#27: 3 2003 2000 2005 5
# id x.fyear byear eyear val
库(data.table)#v>=1.9.8
setDT(sdata);setDT(fdata)#就地转换为data.table
fdata[sdata,on=(fyear>=byear,fyear
您还可以在1.9.6中使用foverlaps
,只需稍加努力。使用过滤器。(但请注意,此答案不会生成正确的左连接
;但MWE会使用内部连接
给出正确的结果。)
如果要求在没有合并内容的情况下合并两个表,dplyr
包会不高兴,因此在下面,我在这两个表中创建了一个虚拟变量,然后进行筛选,然后删除dummy
:
fdata %>%
mutate(dummy=TRUE) %>%
left_join(sdata %>% mutate(dummy=TRUE)) %>%
filter(fyear >= byear, fyear < eyear) %>%
select(-dummy)
使用SQL更干净地执行此操作会得到完全相同的结果:
> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"
<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
>tbl(pg,sql)
+挑选*
+来自fdata
+左连接sdata
+在fyear>=byear和fyear=byear和fyear=sdata.byear)和(fdata.fyearfdata上的顺序扫描(成本=0.00..28.50行=1850宽度=16)
->具体化(成本=0.00..33.55行=1570宽度=24)
->sdata上的顺序扫描(成本=0.00..25.70行=1570宽度=24)
这看起来像是包fuzzyjoin处理的那种任务。包的各种功能的外观和工作方式与dplyr连接功能类似
在这种情况下,其中一个fuzzy.*\u join
函数将适用于您。dplyr::left_join
和fuzzyjoin::fuzzy_left_join
之间的主要区别在于,您可以使用match.fun
参数给出一个函数列表,以便在匹配过程中使用。注意,by
参数的编写方式与left\u join
中的相同
下面是一个例子。我用于匹配的函数分别是codefyear/code-to-codebyear/code和codefyear/code-to-codebyear/code比较的=
和
match_fun=列表(`>=`,`就像我的答案一样,这不会产生有效的左连接
。用fyear==2011
的观察值来扩充左数据框,然后过滤查询结果,在fyear==2011
上没有任何内容。这在SQL中起作用:从fyear>=year>和fyearsetDF
可以在以后使用,如果有人想将其数据集返回为普通数据。frame@eddi联接之后,在获取列中是否有一个与之等效的data.table(i.*,x.fear)即表i中的所有列,但只有表x中的恐惧谢谢。此解决方案比tidyr
/dplyr
更干净、更快,并且在添加更多条件时有效。fyear>=byear,fyear> fdata %>%
+ mutate(dummy=TRUE) %>%
+ left_join(sdata %>% mutate(dummy=TRUE)) %>%
+ filter(fyear >= byear, fyear < eyear) %>%
+ select(-dummy) %>%
+ explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"
LEFT JOIN
(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"
USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"
<PLAN>
Nested Loop (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"
<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
library(fuzzyjoin)
fuzzy_left_join(fdata, sdata,
by = c("fyear" = "byear", "fyear" = "eyear"),
match_fun = list(`>=`, `<`))
Source: local data frame [27 x 5]
id fyear byear eyear val
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1998 1995 2000 1
2 1 1999 1995 2000 1
3 1 2000 2000 2005 5
4 1 2001 2000 2005 5
5 2 1998 1995 2000 1
6 2 1999 1995 2000 1
7 2 2000 2000 2005 5
8 2 2001 2000 2005 5
9 2 2002 2000 2005 5
10 2 2003 2000 2005 5
.. ... ... ... ... ...