Hive 如何使用其他表的数据更新配置单元中表的某些列_Hive_Hiveql

Hive 如何使用其他表的数据更新配置单元中表的某些列

hive

Hive 如何使用其他表的数据更新配置单元中表的某些列,hive,hiveql,Hive,Hiveql,我想从另一个表中更新一个表中某些列的数据表cust_tabl中的这三列cf_mng、cf_sds、cf_htg没有数据我想更新cust tabl的这三列cf_mng、cf_sds、cf_htg的数据使用自定义历史表的cust_cd_cnt_1、cust_cd_cnt_2、cust_cd_cnt_3三列的数据此表包含从201505到201509的数据 CREATE TABLE custom_hist( cust_no varchar(20), cust_cd_cnt_1 fl

我想从另一个表中更新一个表中某些列的数据

表cust_tabl中的这三列cf_mng、cf_sds、cf_htg没有数据

我想更新cust tabl的这三列cf_mng、cf_sds、cf_htg的数据使用自定义历史表的cust_cd_cnt_1、cust_cd_cnt_2、cust_cd_cnt_3三列的数据

此表包含从201505到201509的数据

CREATE TABLE custom_hist( 
 cust_no varchar(20),    
 cust_cd_cnt_1 float,  
 cust_cd_cnt_2 float,  
 cust_cd_cnt_3 float,  
 cust_dt date,
 cust_name string) 
 PARTITIONED BY (yyyymm int);

此表包含从201403到201606的数据

CREATE TABLE cust_tabl(
cust_no string,  
cf_mng double,  
cf_sds double,  
cf_htg double,  
cust_loc string,  
cust_region string,  
cust_country string,
cust_reg_id smallint)
PARTITIONED BY (yyyymm int);

请帮助我。

通过主键连接表并覆盖连接的分区。检查主键。联接基数应为1:1或1:0，否则应应用一些行数或秩或一些聚合，如max，以限制联接后的行数：

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;

insert overwrite table cust_tabl partition (yyyymm)
select 
      c.cust_no,
      coalesce(h.cust_cd_cnt_1,c.cf_mng) as cf_mng, --take history column if joined
      coalesce(h.cust_cd_cnt_2,c.cf_sds) as cf_sds, --take original if not joined
      coalesce(h.cust_cd_cnt_3,c.cf_htg) as cf_htg,
      c.cust_loc,  --original columns
      c.cust_region,
      c.cust_country,
      c.cust_reg_id,
      c.yyyymm     --partition is the last
  from cust_tabl c
       left join custom_hist h 
                 --assume this is the primary key:
                 on c.cust_no = h.cust_no and c.yyyymm = h.yyyymm;

谢谢你的回答，但是如果我重新插入所有列，这将需要很多时间，因为表非常大，有很多列。你们能帮我通过改变表格或其他方法只更新那个些列吗。谢谢，我不知道除了连接这些表之外的其他有效方法。加入很慢，需要90%的时间。重写速度不是很慢，不要担心itIf表太大，您可以在查询结束时添加distributed by yyyymm，以减少还原程序的压力，并尝试减少hive.exec.reducers.bytes.per.reducer参数以增加还原程序的并行性。客户端不同意重新插入。如果您的配置单元版本支持ACID更新或合并，您可以尝试使用此功能。它不会更快，因为您仍然需要加入数据集并重写表分区文件。另一种可能的方法是将要重写的分区限制为仅已连接的分区，并保持未连接的分区不变。使用子查询而不是表c:select*from cust_tabl，其中yyyymm在select distinct yyymm from custom_hist c中，这样可以避免重写未更新的分区（如果存在）。但是，如果没有这样的子查询或子查询很少，那么额外的子查询可能会降低性能