PostgreSQL聚合并集、交集和集合差异_Sql_Postgresql

PostgreSQL聚合并集、交集和集合差异

sql postgresql

PostgreSQL聚合并集、交集和集合差异,sql,postgresql,Sql,Postgresql,我有一个要汇总的成对表，如下所示： +---------+----------+ | left_id | right_id | +---------+----------+ | a | b | +---------+----------+ | a | c | +---------+----------+ 以及一份价值表，如下所示： +----+-------+ | id | value | +----+-------+ | a | 1

我有一个要汇总的成对表，如下所示：

+---------+----------+
| left_id | right_id |
+---------+----------+
| a       | b        |
+---------+----------+
| a       | c        |
+---------+----------+

以及一份价值表，如下所示：

+----+-------+
| id | value |
+----+-------+
| a  | 1     |
+----+-------+
| a  | 2     |
+----+-------+
| a  | 3     |
+----+-------+
| b  | 1     |
+----+-------+
| b  | 4     |
+----+-------+
| b  | 5     |
+----+-------+
| c  | 1     |
+----+-------+
| c  | 2     |
+----+-------+
| c  | 3     |
+----+-------+
| c  | 4     |
+----+-------+

对于每一对，我想计算并集、交集和集差（每种方式）的长度，比较这些值，以便输出如下所示：

+---------+----------+-------+--------------+-----------+------------+
| left_id | right_id | union | intersection | left_diff | right_diff |
+---------+----------+-------+--------------+-----------+------------+
| a       | b        | 5     | 1            | 2         | 2          |
+---------+----------+-------+--------------+-----------+------------+
| a       | c        | 4     | 3            | 0         | 1          |
+---------+----------+-------+--------------+-----------+------------+

select p.*,
       coalesce(stats."union", 0) "union",
       coalesce(stats.intersection, 0) intersection,
       coalesce(stats.left_cnt - stats.intersection, 0) left_diff,
       coalesce(stats.right_cnt - stats.intersection, 0) right_diff
from pairs p
left join (
       select left_id,
              right_id,
              count(*) "union",
              count(has_left and has_right) intersection,
              count(has_left) left_cnt,
              count(has_right) right_cnt
       from (
              select p.*,
                     v."value" the_value,
                     true has_left
              from pairs p
              join "values" v on v.id = p.left_id
       ) l
       full join (
              select p.*,
                     v."value" the_value,
                     true has_right
              from pairs p
              join "values" v on v.id = p.right_id
       ) r using(left_id, right_id, the_value)
       group by left_id,
                right_id
) stats on p.left_id = stats.left_id
       and p.right_id = stats.right_id;

使用PostgreSQL实现这一点的最佳方法是什么

更新：这里有一个包含数据的rextester链接

您需要标量子查询来实现这一点

UNION还可以用

或表示，这使得该查询的编写时间更短。但是对于交叉点，您需要一个更长的查询
要计算“差异”，请使用except
运算符：
SELECT p.*, 
       (select count(distinct value) from values where id in (p.left_id, p.right_id)) as "union",
       (select count(*)
        from (
          select v.value from values v where id = p.left_id
          intersect
          select v.value from values v where id = p.right_id
        ) t) as intersection,
       (select count(*)
        from (
          select v.value from values v where id = p.left_id
          except
          select v.value from values v where id = p.right_id
        ) t) as left_diff,
       (select count(*)
        from (
          select v.value from values v where id = p.right_id
          except
          select v.value from values v where id = p.left_id
        ) t) as right_diff
from pairs p

我不知道是什么原因导致你行动迟缓，因为我看不到桌子的大小和/或解释计划。假设这两个表都足够大，使得嵌套循环效率低下，并且不敢考虑将值连接到自身，我会尝试将其从标量子查询中重写，如下所示：
+---------+----------+-------+--------------+-----------+------------+
| left_id | right_id | union | intersection | left_diff | right_diff |
+---------+----------+-------+--------------+-----------+------------+
| a       | b        | 5     | 1            | 2         | 2          |
+---------+----------+-------+--------------+-----------+------------+
| a       | c        | 4     | 3            | 0         | 1          |
+---------+----------+-------+--------------+-----------+------------+

select p.*,
       coalesce(stats."union", 0) "union",
       coalesce(stats.intersection, 0) intersection,
       coalesce(stats.left_cnt - stats.intersection, 0) left_diff,
       coalesce(stats.right_cnt - stats.intersection, 0) right_diff
from pairs p
left join (
       select left_id,
              right_id,
              count(*) "union",
              count(has_left and has_right) intersection,
              count(has_left) left_cnt,
              count(has_right) right_cnt
       from (
              select p.*,
                     v."value" the_value,
                     true has_left
              from pairs p
              join "values" v on v.id = p.left_id
       ) l
       full join (
              select p.*,
                     v."value" the_value,
                     true has_right
              from pairs p
              join "values" v on v.id = p.right_id
       ) r using(left_id, right_id, the_value)
       group by left_id,
                right_id
) stats on p.left_id = stats.left_id
       and p.right_id = stats.right_id;

这里的每个连接条件都允许散列和/或合并连接，因此规划人员将有机会避免嵌套循环。
我不明白“成对联合”应该是什么。您能解释一下a，b
的逻辑是什么吗5
？如果将a（1,2,3）和b（1,4,5）中的所有值组合起来，则唯一值的数量。并集=（1,2,3,4,5）=5个值。交集是（1）=1个值，左差（2,3）=2个值，右差（4,5）=2个值。感谢您向我介绍“标量子查询”。我不知道有这样的事。另一位用户在一分钟前发布了一个非常类似的解决方案，但他/她指出这可能会非常低效。不幸的是，现在被删除了（我真的希望在写回复时不会发生这种情况）。有没有办法使这样的查询更有效？我想我可以将结果表保存为一个物化视图，也许吧？非常酷。谢谢@alexey bashtanov和@a_horse_，没有名字，谢谢你的回答。我从这两方面都学到了很多。最后，此解决方案的执行速度略快于“标量子查询”，完成时间为1m17s，而不是1m36s。显示我的数据集的查询计划有点混乱，因为pairs
表是一个聚合多个表（376行）数据的视图，所以计划很大。values
表有290万行，顺便说一句。太糟糕了，我的问题被否决了。如果pairs是一个很小的东西，并且是从某个地方聚合的，也许它有一个CTE是有意义的，因为它只能计算一次。出于计划优化的目的，只需创建一个包含视图内容的临时表并对其进行分析，然后使用它而不是视图。