PostgreSQL聚合并集、交集和集合差异
我有一个要汇总的成对表,如下所示:PostgreSQL聚合并集、交集和集合差异,sql,postgresql,Sql,Postgresql,我有一个要汇总的成对表,如下所示: +---------+----------+ | left_id | right_id | +---------+----------+ | a | b | +---------+----------+ | a | c | +---------+----------+ 以及一份价值表,如下所示: +----+-------+ | id | value | +----+-------+ | a | 1
+---------+----------+
| left_id | right_id |
+---------+----------+
| a | b |
+---------+----------+
| a | c |
+---------+----------+
以及一份价值表,如下所示:
+----+-------+
| id | value |
+----+-------+
| a | 1 |
+----+-------+
| a | 2 |
+----+-------+
| a | 3 |
+----+-------+
| b | 1 |
+----+-------+
| b | 4 |
+----+-------+
| b | 5 |
+----+-------+
| c | 1 |
+----+-------+
| c | 2 |
+----+-------+
| c | 3 |
+----+-------+
| c | 4 |
+----+-------+
对于每一对,我想计算并集、交集和集差(每种方式)的长度,比较这些值,以便输出如下所示:
+---------+----------+-------+--------------+-----------+------------+
| left_id | right_id | union | intersection | left_diff | right_diff |
+---------+----------+-------+--------------+-----------+------------+
| a | b | 5 | 1 | 2 | 2 |
+---------+----------+-------+--------------+-----------+------------+
| a | c | 4 | 3 | 0 | 1 |
+---------+----------+-------+--------------+-----------+------------+
select p.*,
coalesce(stats."union", 0) "union",
coalesce(stats.intersection, 0) intersection,
coalesce(stats.left_cnt - stats.intersection, 0) left_diff,
coalesce(stats.right_cnt - stats.intersection, 0) right_diff
from pairs p
left join (
select left_id,
right_id,
count(*) "union",
count(has_left and has_right) intersection,
count(has_left) left_cnt,
count(has_right) right_cnt
from (
select p.*,
v."value" the_value,
true has_left
from pairs p
join "values" v on v.id = p.left_id
) l
full join (
select p.*,
v."value" the_value,
true has_right
from pairs p
join "values" v on v.id = p.right_id
) r using(left_id, right_id, the_value)
group by left_id,
right_id
) stats on p.left_id = stats.left_id
and p.right_id = stats.right_id;
使用PostgreSQL实现这一点的最佳方法是什么
更新:这里有一个包含数据的rextester链接您需要标量子查询来实现这一点 UNION还可以用
或表示,这使得该查询的编写时间更短。但是对于交叉点,您需要一个更长的查询
要计算“差异”,请使用except
运算符:
SELECT p.*,
(select count(distinct value) from values where id in (p.left_id, p.right_id)) as "union",
(select count(*)
from (
select v.value from values v where id = p.left_id
intersect
select v.value from values v where id = p.right_id
) t) as intersection,
(select count(*)
from (
select v.value from values v where id = p.left_id
except
select v.value from values v where id = p.right_id
) t) as left_diff,
(select count(*)
from (
select v.value from values v where id = p.right_id
except
select v.value from values v where id = p.left_id
) t) as right_diff
from pairs p
我不知道是什么原因导致你行动迟缓,因为我看不到桌子的大小和/或解释计划。假设这两个表都足够大,使得嵌套循环效率低下,并且不敢考虑将值连接到自身,我会尝试将其从标量子查询中重写,如下所示:
+---------+----------+-------+--------------+-----------+------------+
| left_id | right_id | union | intersection | left_diff | right_diff |
+---------+----------+-------+--------------+-----------+------------+
| a | b | 5 | 1 | 2 | 2 |
+---------+----------+-------+--------------+-----------+------------+
| a | c | 4 | 3 | 0 | 1 |
+---------+----------+-------+--------------+-----------+------------+
select p.*,
coalesce(stats."union", 0) "union",
coalesce(stats.intersection, 0) intersection,
coalesce(stats.left_cnt - stats.intersection, 0) left_diff,
coalesce(stats.right_cnt - stats.intersection, 0) right_diff
from pairs p
left join (
select left_id,
right_id,
count(*) "union",
count(has_left and has_right) intersection,
count(has_left) left_cnt,
count(has_right) right_cnt
from (
select p.*,
v."value" the_value,
true has_left
from pairs p
join "values" v on v.id = p.left_id
) l
full join (
select p.*,
v."value" the_value,
true has_right
from pairs p
join "values" v on v.id = p.right_id
) r using(left_id, right_id, the_value)
group by left_id,
right_id
) stats on p.left_id = stats.left_id
and p.right_id = stats.right_id;
这里的每个连接条件都允许散列和/或合并连接,因此规划人员将有机会避免嵌套循环。我不明白“成对联合”应该是什么。您能解释一下a,b
的逻辑是什么吗5
?如果将a(1,2,3)和b(1,4,5)中的所有值组合起来,则唯一值的数量。并集=(1,2,3,4,5)=5个值。交集是(1)=1个值,左差(2,3)=2个值,右差(4,5)=2个值。感谢您向我介绍“标量子查询”。我不知道有这样的事。另一位用户在一分钟前发布了一个非常类似的解决方案,但他/她指出这可能会非常低效。不幸的是,现在被删除了(我真的希望在写回复时不会发生这种情况)。有没有办法使这样的查询更有效?我想我可以将结果表保存为一个物化视图,也许吧?非常酷。谢谢@alexey bashtanov和@a_horse_,没有名字,谢谢你的回答。我从这两方面都学到了很多。最后,此解决方案的执行速度略快于“标量子查询”,完成时间为1m17s,而不是1m36s。显示我的数据集的查询计划有点混乱,因为pairs
表是一个聚合多个表(376行)数据的视图,所以计划很大。values
表有290万行,顺便说一句。太糟糕了,我的问题被否决了。如果pairs是一个很小的东西,并且是从某个地方聚合的,也许它有一个CTE是有意义的,因为它只能计算一次。出于计划优化的目的,只需创建一个包含视图内容的临时表并对其进行分析,然后使用它而不是视图。