Performance 使用配置单元分区表优化连接性能

Performance 使用配置单元分区表优化连接性能,performance,hive,query-optimization,partitioning,Performance,Hive,Query Optimization,Partitioning,我有一个Hive orc test_dev_db.TransactionUpdateTable表,其中包含一些示例数据,其中包含需要更新到主表(test_dev_db.TransactionMainHistoryTable)的增量数据,主表在Country、Tran_date列上进行分区 配置单元增量加载表模式:它包含需要合并的19行 CREATE TABLE IF NOT EXISTS test_dev_db.TransactionUpdateTable ( Transaction_date

我有一个Hive orc test_dev_db.TransactionUpdateTable表,其中包含一些示例数据,其中包含需要更新到主表(test_dev_db.TransactionMainHistoryTable)的增量数据,主表在Country、Tran_date列上进行分区

配置单元增量加载表模式:它包含需要合并的19行

CREATE TABLE IF NOT EXISTS test_dev_db.TransactionUpdateTable 
(
Transaction_date timestamp,
Product       string,
Price         int,
Payment_Type  string,
Name          string, 
City          string,
State         string,
Country       string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
配置单元主表架构:行总数为77

CREATE TABLE IF NOT EXISTS test_dev_db.TransactionMainHistoryTable
(
Transaction_date timestamp,
Product       string,
Price         int,
Payment_Type  string,
Name          string,
City          string,
State         string
)
PARTITIONED BY (Country string,Tran_date string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
我在下面运行查询,将增量数据与主表合并

SELECT
  case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)  
  else t.transaction_date   end as transaction_date,
  t.product,
  case when i.price is not null then i.price else t.price end as price,
  t.payment_type,
  t.name,
  t.city,
  t.state,
  t.country,
  case when i.transaction_date is not null then substring(current_timestamp(),0,10) 
  else t.tran_date end as tran_date
  from
test_dev_db.TransactionMainHistoryTable t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-03-01
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-05-01
并在下面运行查询,以筛选出需要合并的特定分区,从而消除对未更新分区的重写

SELECT
  case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)  
  else t.transaction_date   end as transaction_date,
  t.product,
  case when i.price is not null then i.price else t.price end as price,
  t.payment_type,
  t.name,
  t.city,
  t.state,
  t.country,
  case when i.transaction_date is not null then substring(current_timestamp(),0,10) else t.tran_date end as tran_date
  from
(SELECT 
  *
  FROM 
test_dev_db.TransactionMainHistoryTable
where Tran_date in
(select distinct  from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd') from test_dev_db.TransactionUpdateTable
))t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;
在这两种情况下,只需要更新事务处理日期、价格和分区列事务处理日期。这两个查询都运行良好,尽管横向查询需要更长的执行时间

分区表的执行计划如下:

 Stage: Stage-5
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: transactionmainhistorytable
            filterExpr: tran_date is not null (type: boolean)
            Statistics: Num rows: 77 Data size: 39151 Basic stats: COMPLETE Column stats: COMPLETE
            Map Join Operator
              condition map:
                   Left Semi Join 0 to 1
              keys:
                0 tran_date (type: string)
                1 _col0 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8
我在第二个问题上做错了什么吗?我是否需要使用两个分区列来进行更好的修剪。
非常感谢任何帮助或建议。

也许这不是一个完整的答案,但我希望这些想法会有用

where tran_date IN (select ... )
实际上和

LEFT SEMI JOIN (SELECT ...)
这反映在计划中:

Map Join Operator
              condition map:
                   Left Semi Join 0 to 1
              keys:
                0 tran_date (type: string)
                1 _col0 (type: string) 
它作为映射连接执行。首先选择子查询数据集,然后将其放置在分布式缓存中,加载到内存中以用于映射联接。所有这些步骤:选择、加载到内存中、映射联接都比读取和覆盖所有表慢,因为它太小且分区过多:统计数据显示Num rows:77 Data size:39151-太小而无法被两列分区,甚至太小而根本无法分区。尝试使用更大的表并使用EXPLAIN EXTENDED检查真正被扫描的内容

此外,请将其替换为:

from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd')
使用
substr(交易日期,0,10)
date(交易日期)

子字符串(当前时间戳,0,10)
当前日期
只是为了简化代码

如果要在计划中显示分区筛选器,请尝试将传递的分区筛选器替换为分区列表,您可以在单独的会话中选择该列表,并使用shell将分区列表传递到where子句中,请参见以下答案: