Sql 配置单元:计算列的唯一值的内部联接之间的匹配数
我试图计算由于两个表的内部联接而导致的列之间的匹配,该联接基于两个表之一中单个列的唯一值。举个例子可以让事情变得更清楚: 如果我有以下两张表:Sql 配置单元:计算列的唯一值的内部联接之间的匹配数,sql,hive,hiveql,Sql,Hive,Hiveql,我试图计算由于两个表的内部联接而导致的列之间的匹配,该联接基于两个表之一中单个列的唯一值。举个例子可以让事情变得更清楚: 如果我有以下两张表: Table A ------- id_A: info_A 1 'a' 2 'b' 3 'c' 3 'd' Table B ------- id_B: info_B 1 'a' 3 'c' 5 'b' 我想找到唯一的id\u A:[1,2,3]以及与它们相关联的
Table A
-------
id_A: info_A
1 'a'
2 'b'
3 'c'
3 'd'
Table B
-------
id_B: info_B
1 'a'
3 'c'
5 'b'
我想找到唯一的id\u A
:[1,2,3]
以及与它们相关联的信息A
:[A','b','c','d']
我想创建一个如下所示的表:
Table join of A+B
-----------------
id_A: info_A id_B info_B match_cnt
1 'a' 1 'a' 1
3 'c','d' 3 'c' 0.5
其中match\u cnt
是给定id\u A
的info\u A
和info\u B
之间的匹配数。仅供参考,我正在处理的实际表有数十亿行
代码块演示了我所尝试的内容,以及各种变体(以下未显示):
您可以使用以下内容:-
WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B group by ID_A,INFO_A)
select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from
(select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)
tmp join T1 on T1.ID_A=tmp.ID_A
您可以使用以下内容:-
WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B group by ID_A,INFO_A)
select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from
(select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)
tmp join T1 on T1.ID_A=tmp.ID_A
+----+-----------+--------+-----------+
| id | info_a | info_b | match_cnt |
+----+-----------+--------+-----------+
| 1 | ["a"] | ["a"] | 1.0 |
| 3 | ["c","d"] | ["c"] | 0.5 |
+----+-----------+--------+-----------+
WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B group by ID_A,INFO_A)
select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from
(select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)
tmp join T1 on T1.ID_A=tmp.ID_A