Arrays HiveQL:如何在数组的列中查找重复元素<;字符串>;
我正在HiveQL中创建一个表,其中一列Arrays HiveQL:如何在数组的列中查找重复元素<;字符串>;,arrays,hadoop,hive,hiveql,Arrays,Hadoop,Hive,Hiveql,我正在HiveQL中创建一个表,其中一列duplicate\u set应该是一个数组,其中包含来自另一列list的列表中的重复元素集。例如,给定一个表 +-----------+-------------------------+----------------------+ | id | list | duplicate_set | +-----------+-------------------------+---------
duplicate\u set
应该是一个数组,其中包含来自另一列list
的列表中的重复元素集。例如,给定一个表
+-----------+-------------------------+----------------------+
| id | list | duplicate_set |
+-----------+-------------------------+----------------------+
| 1 | ["1","2","2","3","3"] | ["2","3"] |
+-----------+-------------------------+----------------------+
| 2 | ["2","2","5","6"] | ["2"] |
+-----------+-------------------------+----------------------+
| 3 | ["2","4","5","6"] | [] |
...
提取重复元素并将其放入集合的最佳方法是什么?是否有任何现有的UDF?谢谢。您可以分解数组,计算
行数
,然后将重复的元素(行数>1)聚合到集合中:
with initial_data as (
select 1 id ,array("1","2","2","3","3") list union all
select 2 ,array("2","2","5","6") list union all
select 3 ,array("2","4","5","6")
)
select s.id, s.list, collect_set(case when s.rn>1 then x end) duplicate_set
from(
select s.id, s.list, l.x, row_number() over(partition by id, l.x) as rn
from initial_data s
lateral view explode(list) l as x --array element x
) s
group by s.id, s.list;
结果:
id list duplicate_set
1 ["1","2","2","3","3"] ["2","3"]
2 ["2","2","5","6"] ["2"]
3 ["2","4","5","6"] []
原始表中是否有
array
列?@VamsiPrabhala,不确定我是否理解正确,列表列就是数组列。如果元素不重复,是否也要空列表?这是一个优雅的解决方案+这真是太棒了!谢谢@leftjoin