Hive 使用配置单元表中的重复行获取集合差异_Hive_Hiveql

Hive 使用配置单元表中的重复行获取集合差异

hive

Hive 使用配置单元表中的重复行获取集合差异,hive,hiveql,Hive,Hiveql,我有两张蜂巢桌：表1，表2。表1有重复的行，表2没有。我想从表1中获取表2中不存在的缺失数据，包括重复数据。如何在配置单元查询语言中完成此操作例如：表1数据： Col1,Col2 A1,V1 A1,V1 A2,V2 A3,V3 A3,V3 A3,V3 A4,V4 表2数据： Col1,Col2 A1,V1 A2,V2 A3,V3 我想从表1中获取以下缺失数据： Col1,Col2 A1,V1 A3,V3 A3,V3 A4,V4 您可以使用以下内容： with t1 as ( sel

我有两张蜂巢桌：表1，表2。表1有重复的行，表2没有。我想从表1中获取表2中不存在的缺失数据，包括重复数据。如何在配置单元查询语言中完成此操作

例如：

表1数据：

Col1,Col2
A1,V1
A1,V1
A2,V2
A3,V3
A3,V3
A3,V3
A4,V4

表2数据：

Col1,Col2
A1,V1
A2,V2
A3,V3

我想从表1中获取以下缺失数据：

Col1,Col2
A1,V1
A3,V3
A3,V3
A4,V4

您可以使用以下内容：

with t1 as (
  select 'A1' col1,'V1' col2 union all
  select 'A1' col1,'V1' col2 union all
  select 'A2' col1,'V2' col2 union all
  select 'A3' col1,'V3' col2 union all
  select 'A3' col1,'V3' col2 union all
  select 'A3' col1,'V3' col2 union all
  select 'A4' col1,'V4' col2
),
t2 as (
  select 'A1' col1,'V1' col2 union all
  select 'A2' col1,'V2' col2 union all
  select 'A3' col1,'V3' col2
),
t1_with_rn as (
  select t1.*, row_number() over(partition by t1.col1, t1.col2) rn from t1
)
select 
  t1_with_rn.col1, t1_with_rn.col2
from 
  t1_with_rn
  left join t2 on (t1_with_rn.col1 = t2.col1 and t1_with_rn.col2 = t2.col2 and t1_with_rn.rn = 1)
where
  t2.col1 is null and t2.col2 is null

谢谢，这个有用。但是，我的表有数百万行。那么，让我们看看它的性能如何。