Hive 配置单元多级分区并选择WITHERE子句

Hive 配置单元多级分区并选择WITHERE子句,hive,mapreduce,hiveql,hadoop2,Hive,Mapreduce,Hiveql,Hadoop2,我有两个分区的蜂巢表,第一个分区是城市,第二个分区是村庄,所以每个城市分区都会包含所有村庄分区的列表 city1/village1 city1/village2 city1/village3 city2/village5 city2/village6 所以如果我的select语句是select*fromtable where village='village5' 它会在输出结果之前搜索城市1和城市2中的所有分区吗?或者它会看到配置单元元存储文件并只命中village5分区。这取决于配置单元版本

我有两个分区的蜂巢表,第一个分区是城市,第二个分区是村庄,所以每个城市分区都会包含所有村庄分区的列表

city1/village1
city1/village2
city1/village3
city2/village5
city2/village6
所以如果我的select语句是
select*fromtable where village='village5'

它会在输出结果之前搜索城市1和城市2中的所有分区吗?或者它会看到配置单元元存储文件并只命中village5分区。

这取决于配置单元版本的优化程度。在我当前的版本(1.1.0)中,Hive能够指向特定的分区,而无需扫描顶部分区

下面是一个快速演示

create table mydb.partition_test 
(id string)
partitioned by (city string, village string);

INSERT OVERWRITE TABLE mydb.partition_test PARTITION (city,village)
select * from (

select '1', 'city1', 'village1'
union all 
select '1', 'city1', 'village2'
union all 
select '1', 'city1', 'village3'
union all 
select '1', 'city2', 'village5'
union all 
select '1', 'city2', 'village6'
) t;

explain select * from mydb.partition_test where village='village5';

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: partition_test
          filterExpr: (village = 'village5') (type: boolean)
          Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
          Select Operator
            expressions: id (type: string), city (type: string), 'village5' (type: string)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
            ListSink

正如您从执行计划中看到的,它能够估计特定分区的记录数,而无需映射操作,并且表扫描指向特定分区

这将取决于您的配置单元版本的优化程度。在我当前的版本(1.1.0)中,Hive能够指向特定的分区,而无需扫描顶部分区

下面是一个快速演示

create table mydb.partition_test 
(id string)
partitioned by (city string, village string);

INSERT OVERWRITE TABLE mydb.partition_test PARTITION (city,village)
select * from (

select '1', 'city1', 'village1'
union all 
select '1', 'city1', 'village2'
union all 
select '1', 'city1', 'village3'
union all 
select '1', 'city2', 'village5'
union all 
select '1', 'city2', 'village6'
) t;

explain select * from mydb.partition_test where village='village5';

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: partition_test
          filterExpr: (village = 'village5') (type: boolean)
          Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
          Select Operator
            expressions: id (type: string), city (type: string), 'village5' (type: string)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
            ListSink
正如您从执行计划中看到的,它能够估计特定分区的记录数,而无需映射操作,并且表扫描指向特定分区

谢谢!!所以蜂巢版本谢谢你!!那么蜂巢版本呢