Sql 配置单元:计算列的唯一值的内部联接之间的匹配数

Sql 配置单元:计算列的唯一值的内部联接之间的匹配数,sql,hive,hiveql,Sql,Hive,Hiveql,我试图计算由于两个表的内部联接而导致的列之间的匹配,该联接基于两个表之一中单个列的唯一值。举个例子可以让事情变得更清楚: 如果我有以下两张表: Table A ------- id_A: info_A 1 'a' 2 'b' 3 'c' 3 'd' Table B ------- id_B: info_B 1 'a' 3 'c' 5 'b' 我想找到唯一的id\u A:[1,2,3]以及与它们相关联的

我试图计算由于两个表的内部联接而导致的列之间的匹配,该联接基于两个表之一中单个列的唯一值。举个例子可以让事情变得更清楚:

如果我有以下两张表:

Table A
-------
id_A:   info_A
1       'a'
2       'b'
3       'c'
3       'd'

Table B
-------
id_B:   info_B
1       'a'
3       'c'
5       'b'
我想找到唯一的
id\u A
[1,2,3]
以及与它们相关联的
信息A
[A','b','c','d']

我想创建一个如下所示的表:

Table join of A+B
-----------------
id_A:   info_A   id_B   info_B    match_cnt
1       'a'      1      'a'       1
3       'c','d'  3      'c'       0.5
其中
match\u cnt
是给定
id\u A
info\u A
info\u B
之间的匹配数。仅供参考,我正在处理的实际表有数十亿行

代码块演示了我所尝试的内容,以及各种变体(以下未显示):




您可以使用以下内容:-

 WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B  group by ID_A,INFO_A)

      select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from 

      (select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)

 tmp join T1 on T1.ID_A=tmp.ID_A

您可以使用以下内容:-

 WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B  group by ID_A,INFO_A)

      select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from 

      (select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)

 tmp join T1 on T1.ID_A=tmp.ID_A
+----+-----------+--------+-----------+
| id |  info_a   | info_b | match_cnt |
+----+-----------+--------+-----------+
|  1 | ["a"]     | ["a"]  | 1.0       |
|  3 | ["c","d"] | ["c"]  | 0.5       |
+----+-----------+--------+-----------+
 WITH T1 AS ( select ID_A ,count(1) as cnt from tableA inner join tableB on tableA.ID_A=tableB.ID_B and tableA.INFO_A=tableB.INFO_B  group by ID_A,INFO_A)

      select distinct tmp.ID_A,tmp.a,tmp.ID_B,tmp.b, (cnt/size(a)) from 

      (select ID_A ,collect_set(INFO_A) as a,ID_B,collect_set(INFO_B) as b from tableA inner join tableB on tableA.a=tableB.a group by tableA.a,tableB.a)

 tmp join T1 on T1.ID_A=tmp.ID_A