如何提高hive中的性能

如何提高hive中的性能,hive,Hive,我正在运行下面的查询,在蜂箱中大约有200万个。有没有办法提高性能?源配置单元表是创建日期的分区列 select t.id, case when t.amt_1_rank < 0.3*f.amt_1_count then t.amt_1 else null end as amt_1, case when t.amt_2_rank < 0.3*f.amt_2_count then t.amt_2 else null end as amt_2, .. .. .. --

我正在运行下面的查询,在蜂箱中大约有200万个。有没有办法提高性能?源配置单元表是创建日期的分区列

 select t.id,
    case when t.amt_1_rank < 0.3*f.amt_1_count then t.amt_1 else null end as amt_1,
    case when t.amt_2_rank < 0.3*f.amt_2_count then t.amt_2 else null end as amt_2,
..
..
..  -- Like wise 30 columns e.g. amt_3,amt_3...
     from (
    select a.id,
    a.amt_1,
    row_number() over (ORDER BY cast(a.amt_1 AS DECIMAL(8,7)) DESC) AS amt_1_rank,
    a.amt_2,
    row_number() over (ORDER BY cast(a.amt_2 AS DECIMAL(8,7)) DESC) AS amt_2_rank
    from source_table a WHERE created_date='2017-10-15' )t
    join 
    ( 
    SELECT count(case when amt_1='.' then null else 1 end) AS amt_1_count,
    count(case when amt_2='.' then null else 1 end) AS amt_2_count,
..
..

    FROM   source_table
    WHERE  created_date='2017-10-15' 
    ) f

无需加入即可完成:

select t.id,
    case when t.amt_1_rank < 0.3*t.amt_1_count then t.amt_1 else null end as amt_1,
    case when t.amt_2_rank < 0.3*t.amt_2_count then t.amt_2 else null end as amt_2,
..
..
..  -- Like wise 30 columns e.g. amt_3,amt_3...
     from (
    select a.id,
           a.amt_1,
           row_number() over (ORDER BY cast(a.amt_1 AS DECIMAL(8,7)) DESC) AS amt_1_rank,
           a.amt_2,
           row_number() over (ORDER BY cast(a.amt_2 AS DECIMAL(8,7)) DESC) AS amt_2_rank,
           count(amt_1_flag) over()                                        AS amt_1_count,
           count(amt_2_flag) over()                                        AS amt_2_count
      from 
           (select a.*,
                   case when amt_1='.' then null else 1 end as amt_1_flag,
                   case when amt_2='.' then null else 1 end as amt_2_flag
              from source_table a WHERE created_date='2017-10-15' 
            )a  

           )t

无需加入即可完成:

select t.id,
    case when t.amt_1_rank < 0.3*t.amt_1_count then t.amt_1 else null end as amt_1,
    case when t.amt_2_rank < 0.3*t.amt_2_count then t.amt_2 else null end as amt_2,
..
..
..  -- Like wise 30 columns e.g. amt_3,amt_3...
     from (
    select a.id,
           a.amt_1,
           row_number() over (ORDER BY cast(a.amt_1 AS DECIMAL(8,7)) DESC) AS amt_1_rank,
           a.amt_2,
           row_number() over (ORDER BY cast(a.amt_2 AS DECIMAL(8,7)) DESC) AS amt_2_rank,
           count(amt_1_flag) over()                                        AS amt_1_count,
           count(amt_2_flag) over()                                        AS amt_2_count
      from 
           (select a.*,
                   case when amt_1='.' then null else 1 end as amt_1_flag,
                   case when amt_2='.' then null else 1 end as amt_2_flag
              from source_table a WHERE created_date='2017-10-15' 
            )a  

           )t

您认为它在20-30列(如this@user2672739为什么不呢?请解释你的观点,我不完全明白。我只是建议删除join和一个不必要的表scan。这里的问题是,每列上的row_number over by子句都需要时间。因为大约有30个这样的专栏,或者准确地说,有其他的方法吗?@user2672739不,不幸的是,我知道。你认为它会对大约20-30个这样的专栏起作用吗this@user2672739为什么不呢?请解释你的观点,我不完全明白。我只是建议删除join和一个不必要的表scan。这里的问题是,每列上的row_number over by子句都需要时间。因为大约有30个这样的列,或者准确地说,有其他方法吗?@user2672739不,不幸的是,我知道没有。在表上应用压缩编解码器这将花费更少的时间在表上应用压缩编解码器这将花费更少的时间