Hive 配置单元中具有相同表的多个联接
在下面的查询中,我基于同一个键将表T1连接到多个表。我想知道在这种情况下,我是否需要指定条件Hive 配置单元中具有相同表的多个联接,hive,hiveql,Hive,Hiveql,在下面的查询中,我基于同一个键将表T1连接到多个表。我想知道在这种情况下,我是否需要指定条件 AND a.ds = '2014-12-10' AND a.org_id IS NULL 每次加入?不这样做的理由是什么 INSERT OVERWRITE TABLE tab1 PARTITION(ds='2014-12-10') SELECT a.var1 , b.var2 , c.var3 , d.var4
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
每次加入?不这样做的理由是什么
INSERT OVERWRITE TABLE tab1
PARTITION(ds='2014-12-10')
SELECT
a.var1
, b.var2
, c.var3
, d.var4
FROM T1 a
LEFT OUTER JOIN T2 b
ON a.var1 = settings.var1
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
ON a.var1 = bmid.var1
AND c.ds = '2014-12-10'
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T4 d
ON a.var1 = daa.var1
AND d.ds = '2014-12-10'
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
不,你没有。您可以简单地将这些条件移动到整个查询的“where”子句中。检查解释计划,它将与您当前拥有的相同 代码示例:
INSERT OVERWRITE TABLE tab1
PARTITION(ds='2014-12-10')
SELECT
a.var1
, b.var2
, c.var3
, d.var4
FROM T1 a
LEFT OUTER JOIN T2 b
ON a.var1 = settings.var1
LEFT OUTER JOIN T3 c
ON a.var1 = bmid.var1
AND c.ds = '2014-12-10'
LEFT OUTER JOIN T4 d
ON a.var1 = daa.var1
AND d.ds = '2014-12-10'
**WHERE
a.ds = '2014-12-10'
AND a.org_id IS NULL**
这里是我发现的:有一个PredicatePushDown的概念(我不是100%确定,但在新版本的Hive中它是默认的)。如果我将其设置为hive.optimize.ppd=true;然后我在两种情况下获得相同的性能: I情况:在所有联接中指定条件 结果: --24254行加载到表1 --推出MapReduce作业: --作业0:Map:16 Reduce:4累计CPU:802.6秒HDFS读取:3020743758 HDFS写入:900057成功 --作业1:映射:1累计CPU:4.93秒HDFS读取:965541 HDFS写入:898430成功 --MapReduce CPU总时间:13分27秒530毫秒
INSERT OVERWRITE TABLE tab1
blah blah ..
FROM T1 a
LEFT OUTER JOIN T2 b
ON a.var1 = settings.var1
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
ON a.var1 = bmid.var1
AND c.ds = '2014-12-10'
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T4 d
ON a.var1 = daa.var1
AND d.ds = '2014-12-10'
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
II情况:条件仅在第一次连接中指定
结果:
--24254
--作业0:Map:16 Reduce:4累计CPU:803.35秒HDFS读取:3020743758 HDFS写入:900057成功
--作业1:映射:1累计CPU:3.75秒HDFS读取:965541 HDFS写入:898429成功
--MapReduce CPU总时间:13分27秒100毫秒
INSERT OVERWRITE TABLE tab1
blah blah ..
FROM T1 a
LEFT OUTER JOIN T2 b
ON a.var1 = settings.var1
AND a.ds = '2014-12-10'
AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
ON a.var1 = bmid.var1
AND c.ds = '2014-12-10'
LEFT OUTER JOIN T4 d
ON a.var1 = daa.var1
AND d.ds = '2014-12-10'
where子句??你说的是IN吗?我在上面添加了一个代码示例。这里有一个供参考的链接:我了解到使用Where不如在Join-in-HIVE中进行过滤那么有效。而且,我最近还了解到,如果我只是在第一次加入时使用它,它就会起作用。结果将与使用WHERE子句相同。