Sql 配置单元：如果两个表之间满足条件，则查找唯一值_Sql_Hive_Hiveql

Sql 配置单元：如果两个表之间满足条件，则查找唯一值

sql hive

Sql 配置单元：如果两个表之间满足条件，则查找唯一值,sql,hive,hiveql,Sql,Hive,Hiveql,我有两张桌子表1列出了我感兴趣的所有独特位置（30行）：地点日本中国印度 ... 使用（以有效的方式实现不相关的存在），它将只过滤连接的记录，然后应用不同的： create temporary table imp.unique_ids_tmp as select distinct t2.id --distinct is not a function, do not need () from table2 t2 left semi join table1 t1 on t2

我有两张桌子<代码>表1列出了我感兴趣的所有独特位置（30行）：

地点日本中国印度 ... 使用（以有效的方式实现不相关的存在），它将只过滤连接的记录，然后应用不同的：

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
       left semi join table1 t1 on t2.places = t1.places
 where t2.date = '20210204'
;

将满足“至少一次”条件：数据集中不存在没有关联记录的ID

另一种方法是使用关联存在：

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204' 
   --this condition is true as soon as one match is found
   and exists (select 1 from table1 t1 where t2.places = t1.places)
;

在这方面也会起作用

Correlated EXIST看起来接近“一旦找到满意的id，它就会停止查看这些id记录”，但所有这些方法都是使用配置单元中的联接实现的。执行EXPLAIN，您将看到，它将与生成的计划相同，尽管它取决于您的版本中的实现。由于不需要检查子查询中的所有记录，因此可能存在的速度可能会更快。考虑到包含30行的table1足够小，可以放入内存，MAP-JOIN（

set-hive.auto.convert.JOIN=true；

）将为您提供最佳性能

使用数组或IN（静态_列表）的另一种快速方法。它可以用于小型和静态阵列。有序阵列可能为您提供更好的性能：

select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204'
       and array_contains(array('australia', 'china', 'japan', ... ), t2.places)
       --OR use t2.places IN ('australia', 'china', 'japan', ... )

为什么这种方法更快：因为从hdfs读取表不需要启动mapper和计算拆分，所以只读取表2。缺点是值列表是静态的。另一方面，可以将整个列表作为参数传递，请参见