Hive 如何仅在配置单元中不存在分区时插入覆盖分区?
如何仅在配置单元中不存在分区时插入覆盖分区Hive 如何仅在配置单元中不存在分区时插入覆盖分区?,hive,hiveql,hive-partitions,Hive,Hiveql,Hive Partitions,如何仅在配置单元中不存在分区时插入覆盖分区 就像标题一样。我正在做一些总是需要重写蜂巢表的事情。我有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。您可以与现有分区列表合并,并在其为空条件时添加(不仅合并)。此外,您还可以使用NOT EXISTS(它将生成与配置单元中的left join相同的计划),如下所示: insert overwrite table target_table partition (partition_key) select col
就像标题一样。我正在做一些总是需要重写蜂巢表的事情。我有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。您可以与现有分区列表合并,并在其为空条件时添加(不仅合并)。此外,您还可以使用NOT EXISTS(它将生成与配置单元中的left join相同的计划),如下所示:
insert overwrite table target_table partition (partition_key)
select col1, ... coln, s.partition_key
from source s
left join (select distinct partition_key --existing partitions
from target_table
) t on s.partition_key=t.partition_key
where t.partition_key is NULL; --no partitions exists in the target
一种选择是将源数据集与目标表中不同的分区列连接起来(将分区列作为键左连接),并过滤掉公共分区。你知道我的意思;您的配置单元查询应如下所示:
insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
src.column1,
src.column2,
....,
src.columnN,
src.partition_column1,
src.partition_column2,
....,
src.partition_columnN
from
source src
left join
(
select distinct
partition_column1,
partition_column2,
....,
partition_columnN
from
target
)
tgt
on src.partition_column1 = tgt.partition_column1
and src.partition_column1 = tgt.partition_column1
...
src.partition_columnN = tgt.partition_columnN
where
tgt.partition_column1 is null
or tgt.partition_column2 is null
...
tgt.partition_columnN is null;
下面给出了该逻辑的简单演示:
让我们创建两个名为orders和orders\u source的表。order表将是目标表,orders\u source是源表。为了简单起见,我对这两个表使用了类似的模式
CREATE TABLE `orders`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
CREATE TABLE `orders_source`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
接下来,插入一些用于测试逻辑的示例记录:
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
insert overwrite table orders partition (state, order_date)
select
orde.id,
orde.customer_id,
orde.shipper_id,
orde.state,
orde.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10241 as id,
20481 as customer_id,
30721 as shipper_id,
'GA' as state,
'2019-09-01' as order_date
)
orde;
insert overwrite table orders_source partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10242 as id,
20482 as customer_id,
30722 as shipper_id,
'CA' as state,
'2019-09-02' as order_date
union all
select
10243 as id,
20483 as customer_id,
30723 as shipper_id,
'FL' as state,
'2019-09-02' as order_date
union all
select
10244 as id,
20484 as customer_id,
30724 as shipper_id,
'TX' as state,
'2019-09-02' as order_date
)
orso;
现在,在运行实际业务逻辑之前,让我们检查插入到两个表中的数据:
hive (default)> select * from orders_source;
OK
orders_source.id orders_source.customer_id orders_source.shipper_id orders_source.state orders_source.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10241 20481 30721 GA 2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)
接下来,执行我们的逻辑,从源表中选择记录并插入到目标表中。您可以运行以下查询:
hive (default)> select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
OK
orso.id orso.customer_id orso.shipper_id orso.state orso.order_date
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
10242 20482 30722 CA 2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)
insert overwrite table orders partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
您可以看到上述结果
最后,通过发出以下查询将记录插入目标表:
hive (default)> select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
OK
orso.id orso.customer_id orso.shipper_id orso.state orso.order_date
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
10242 20482 30722 CA 2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)
insert overwrite table orders partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
现在,让我们在insert操作之后验证目标表中的数据
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10241 20481 30721 GA 2019-09-01
10244 20484 30724 TX 2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)
就这样。你都准备好了