Hive 如何仅在配置单元中不存在分区时插入覆盖分区?

Hive 如何仅在配置单元中不存在分区时插入覆盖分区?,hive,hiveql,hive-partitions,Hive,Hiveql,Hive Partitions,如何仅在配置单元中不存在分区时插入覆盖分区 就像标题一样。我正在做一些总是需要重写蜂巢表的事情。我有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。您可以与现有分区列表合并,并在其为空条件时添加(不仅合并)。此外,您还可以使用NOT EXISTS(它将生成与配置单元中的left join相同的计划),如下所示: insert overwrite table target_table partition (partition_key) select col

如何仅在配置单元中不存在分区时插入覆盖分区


就像标题一样。我正在做一些总是需要重写蜂巢表的事情。我有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。

您可以与现有分区列表合并,并在其为空条件时添加(不仅合并)。此外,您还可以使用NOT EXISTS(它将生成与配置单元中的left join相同的计划),如下所示:

   insert overwrite table target_table partition (partition_key)
    select col1, ... coln, s.partition_key
      from source s 
           left join (select distinct partition_key --existing partitions
                       from target_table
                     ) t on s.partition_key=t.partition_key
     where t.partition_key is NULL; --no partitions exists in the target
一种选择是将源数据集与目标表中不同的分区列连接起来(将分区列作为键左连接),并过滤掉公共分区。你知道我的意思;您的配置单元查询应如下所示:

insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
   src.column1,
   src.column2,
   ....,
   src.columnN,
   src.partition_column1,
   src.partition_column2,
   ....,
   src.partition_columnN
from
   source src 
   left join
      (
         select distinct
            partition_column1,
            partition_column2,
            ....,
            partition_columnN
         from
            target
      )
      tgt 
      on src.partition_column1 = tgt.partition_column1 
      and src.partition_column1 = tgt.partition_column1
      ...
      src.partition_columnN = tgt.partition_columnN 
where
   tgt.partition_column1 is null 
   or tgt.partition_column2 is null
   ...
   tgt.partition_columnN is null;
下面给出了该逻辑的简单演示:

让我们创建两个名为orders和orders\u source的表。order表将是目标表,orders\u source是源表。为了简单起见,我对这两个表使用了类似的模式

CREATE TABLE `orders`(
  `id` int, 
  `customer_id` int, 
  `shipper_id` int)
PARTITIONED BY ( 
  `state` string,
  `order_date` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='id,customer_id', 
  'orc.compress'='SNAPPY', 
  'orc.compress.size'='262144', 
  'orc.create.index'='true', 
  'orc.row.index.stride'='3000', 
  'orc.stripe.size'='268435456');

CREATE TABLE `orders_source`(
  `id` int, 
  `customer_id` int, 
  `shipper_id` int)
PARTITIONED BY ( 
  `state` string,
  `order_date` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='id,customer_id', 
  'orc.compress'='SNAPPY', 
  'orc.compress.size'='262144', 
  'orc.create.index'='true', 
  'orc.row.index.stride'='3000', 
  'orc.stripe.size'='268435456');
接下来,插入一些用于测试逻辑的示例记录:

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;

insert overwrite table orders partition (state, order_date) 
select
   orde.id,
   orde.customer_id,
   orde.shipper_id,
   orde.state,
   orde.order_date 
from
   (
      select
         10240 as id,
         20480 as customer_id,
         30720 as shipper_id,
         'CA' as state,
         '2019-09-01' as order_date 
      union all
      select
         10241 as id,
         20481 as customer_id,
         30721 as shipper_id,
         'GA' as state,
         '2019-09-01' as order_date
   )
   orde;

insert overwrite table orders_source partition (state, order_date) 
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   (
      select
         10240 as id,
         20480 as customer_id,
         30720 as shipper_id,
         'CA' as state,
         '2019-09-01' as order_date 
      union all
      select
         10242 as id,
         20482 as customer_id,
         30722 as shipper_id,
         'CA' as state,
         '2019-09-02' as order_date 
      union all
      select
         10243 as id,
         20483 as customer_id,
         30723 as shipper_id,
         'FL' as state,
         '2019-09-02' as order_date 
      union all
      select
         10244 as id,
         20484 as customer_id,
         30724 as shipper_id,
         'TX' as state,
         '2019-09-02' as order_date
   )
   orso;
现在,在运行实际业务逻辑之前,让我们检查插入到两个表中的数据:

hive (default)> select * from orders_source;
OK
orders_source.id    orders_source.customer_id   orders_source.shipper_id    orders_source.state orders_source.order_date
10240   20480   30720   CA  2019-09-01
10242   20482   30722   CA  2019-09-02
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)

hive (default)> select * from orders;
OK
orders.id   orders.customer_id  orders.shipper_id   orders.state    orders.order_date
10240   20480   30720   CA  2019-09-01
10241   20481   30721   GA  2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)
接下来,执行我们的逻辑,从源表中选择记录并插入到目标表中。您可以运行以下查询:

hive (default)> select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;
OK
orso.id orso.customer_id    orso.shipper_id orso.state  orso.order_date
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
10242   20482   30722   CA  2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)
insert overwrite table orders partition (state, order_date)
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;
您可以看到上述结果

最后,通过发出以下查询将记录插入目标表:

hive (default)> select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;
OK
orso.id orso.customer_id    orso.shipper_id orso.state  orso.order_date
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
10242   20482   30722   CA  2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)
insert overwrite table orders partition (state, order_date)
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;
现在,让我们在insert操作之后验证目标表中的数据

hive (default)> select * from orders;
OK
orders.id   orders.customer_id  orders.shipper_id   orders.state    orders.order_date
10240   20480   30720   CA  2019-09-01
10242   20482   30722   CA  2019-09-02
10243   20483   30723   FL  2019-09-02
10241   20481   30721   GA  2019-09-01
10244   20484   30724   TX  2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)
就这样。你都准备好了