Hive 配置单元中具有相同表的多个联接

Hive 配置单元中具有相同表的多个联接,hive,hiveql,Hive,Hiveql,在下面的查询中,我基于同一个键将表T1连接到多个表。我想知道在这种情况下,我是否需要指定条件 AND a.ds = '2014-12-10' AND a.org_id IS NULL 每次加入?不这样做的理由是什么 INSERT OVERWRITE TABLE tab1 PARTITION(ds='2014-12-10') SELECT a.var1 , b.var2 , c.var3 , d.var4

在下面的查询中,我基于同一个键将表T1连接到多个表。我想知道在这种情况下,我是否需要指定条件

 AND a.ds = '2014-12-10'
 AND a.org_id IS NULL
每次加入?不这样做的理由是什么

INSERT OVERWRITE TABLE tab1
        PARTITION(ds='2014-12-10')
    SELECT
        a.var1
        , b.var2
        , c.var3
        , d.var4

    FROM T1 a
    LEFT OUTER JOIN T2 b
        ON a.var1 = settings.var1

        AND a.ds = '2014-12-10'
        AND a.org_id IS NULL

    LEFT OUTER JOIN T3 c
        ON a.var1 = bmid.var1
        AND c.ds = '2014-12-10'

        AND a.ds = '2014-12-10'
        AND a.org_id IS NULL

    LEFT OUTER JOIN T4 d
        ON a.var1 = daa.var1
        AND d.ds = '2014-12-10'

        AND a.ds = '2014-12-10'
        AND a.org_id IS NULL

不,你没有。您可以简单地将这些条件移动到整个查询的“where”子句中。检查解释计划,它将与您当前拥有的相同

代码示例:

INSERT OVERWRITE TABLE tab1
        PARTITION(ds='2014-12-10')
    SELECT
        a.var1
        , b.var2
        , c.var3
        , d.var4
    FROM T1 a
    LEFT OUTER JOIN T2 b
        ON a.var1 = settings.var1
    LEFT OUTER JOIN T3 c
        ON a.var1 = bmid.var1
        AND c.ds = '2014-12-10'
    LEFT OUTER JOIN T4 d
        ON a.var1 = daa.var1
        AND d.ds = '2014-12-10'
    **WHERE
        a.ds = '2014-12-10'
        AND a.org_id IS NULL**

这里是我发现的:有一个PredicatePushDown的概念(我不是100%确定,但在新版本的Hive中它是默认的)。如果我将其设置为hive.optimize.ppd=true;然后我在两种情况下获得相同的性能:

I情况:在所有联接中指定条件 结果:

--24254行加载到表1

--推出MapReduce作业:

--作业0:Map:16 Reduce:4累计CPU:802.6秒HDFS读取:3020743758 HDFS写入:900057成功

--作业1:映射:1累计CPU:4.93秒HDFS读取:965541 HDFS写入:898430成功

--MapReduce CPU总时间:13分27秒530毫秒

INSERT OVERWRITE TABLE tab1
    blah blah ..
FROM T1 a
LEFT OUTER JOIN T2 b
    ON a.var1 = settings.var1

    AND a.ds = '2014-12-10'
    AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
    ON a.var1 = bmid.var1
    AND c.ds = '2014-12-10'

    AND a.ds = '2014-12-10'
    AND a.org_id IS NULL
LEFT OUTER JOIN T4 d
    ON a.var1 = daa.var1
    AND d.ds = '2014-12-10'

    AND a.ds = '2014-12-10'
    AND a.org_id IS NULL
II情况:条件仅在第一次连接中指定

结果:

--24254

--作业0:Map:16 Reduce:4累计CPU:803.35秒HDFS读取:3020743758 HDFS写入:900057成功

--作业1:映射:1累计CPU:3.75秒HDFS读取:965541 HDFS写入:898429成功

--MapReduce CPU总时间:13分27秒100毫秒

INSERT OVERWRITE TABLE tab1
    blah blah ..
FROM T1 a
LEFT OUTER JOIN T2 b
    ON a.var1 = settings.var1

    AND a.ds = '2014-12-10'
    AND a.org_id IS NULL
LEFT OUTER JOIN T3 c
    ON a.var1 = bmid.var1
    AND c.ds = '2014-12-10'

LEFT OUTER JOIN T4 d
    ON a.var1 = daa.var1
    AND d.ds = '2014-12-10'

where子句??你说的是IN吗?我在上面添加了一个代码示例。这里有一个供参考的链接:我了解到使用Where不如在Join-in-HIVE中进行过滤那么有效。而且,我最近还了解到,如果我只是在第一次加入时使用它,它就会起作用。结果将与使用WHERE子句相同。