Sql 如何在postgres中按数组值查找所有链接行？_Sql_Postgresql_Graph Theory_Recursive Query_Postgresql 12

Sql 如何在postgres中按数组值查找所有链接行？

sql postgresql

Sql 如何在postgres中按数组值查找所有链接行？,sql,postgresql,graph-theory,recursive-query,postgresql-12,Sql,Postgresql,Graph Theory,Recursive Query,Postgresql 12,我有一张这样的桌子： id | arr_val | grp ----------------- 1 | {10,20} | - 2 | {20,30} | - 3 | {50,5} | - 4 | {30,60} | - 5 | {1,5} | - 6 | {7,6} | - id | arr_val | grp ----------------- 1 | {10,20} | 1 2 | {20,30} | 1 3 | {50,5} | 2

我有一张这样的桌子：

id | arr_val  | grp
-----------------
1  | {10,20}  | -
2  | {20,30}  | -
3  | {50,5}   | -
4  | {30,60}  | -
5  | {1,5}    | -
6  | {7,6}    | -

id | arr_val  | grp
-----------------
1  | {10,20}  | 1
2  | {20,30}  | 1
3  | {50,5}   | 2
4  | {30,60}  | 1
5  | {1,5}    | 2
6  | {7,6}    | 3

我想找出哪些行是在一起的。在本例中，1,2,4是一个组，因为1和2有一个公共元素，2和4有一个公共元素。3和5形成一个组，因为它们有一个共同的元素。6与其他人没有共同点。所以它为自己组成了一个团体。结果应该如下所示：

id | arr_val  | grp
-----------------
1  | {10,20}  | -
2  | {20,30}  | -
3  | {50,5}   | -
4  | {30,60}  | -
5  | {1,5}    | -
6  | {7,6}    | -

id | arr_val  | grp
-----------------
1  | {10,20}  | 1
2  | {20,30}  | 1
3  | {50,5}   | 2
4  | {30,60}  | 1
5  | {1,5}    | 2
6  | {7,6}    | 3

我想我需要递归cte，因为我的问题是图形化的，但我不知道如何实现

其他信息和背景：

该表约有2500000行

事实上，我试图解决的问题有更多的领域和条件来找到一个群体：

id | arr_val  | date | val | grp
---------------------------------
1  | {10,20}  | -
2  | {20,30}  | -

一个组的元素不仅需要通过arr_val中的公共元素进行链接。它们都需要在val中具有相同的值，并且需要通过日期间隔和孤岛中的时间跨度进行链接。我解决了另外两个问题，但现在我的问题的条件被添加了。如果有一种简单的方法可以在一个查询中同时完成这三个任务，那将是非常棒的，但这并不是必需的

--编辑---

虽然这两个答案都适用于五行示例，但它们不适用于包含更多行的表。这两个答案都有一个问题，递归部分的行数会爆炸，并且只会在最后减少行数。解决方案也应适用于以下数据：

id | arr_val  | grp
-----------------
1  | {1}      | -
2  | {1}      | -
3  | {1}      | -
4  | {1}      | -
5  | {1}      | -
6  | {1}      | -
7  | {1}      | -
8  | {1}      | -
9  | {1}      | -
10 | {1}      | -
11 | {1}      | -
more rows........

有解决这个问题的方法吗？

下面是解决这个图形漫游问题的方法：

with recursive cte as (
    select id, arr_val, array[id] path from mytable
    union all
    select t.id, t.arr_val, c.path || t.id
    from cte c
    inner join mytable t on t.arr_val && c.arr_val and not t.id = any(c.path)
)
select c.id, c.arr_val, dense_rank() over(order by min(x.id)) grp
from cte c
cross join lateral unnest(c.path) as x(id)
group by c.id, c.arr_val
order by c.id

公共表表达式遍历图形，递归地查找当前节点的相邻节点，同时跟踪已访问的节点。然后外部查询聚合，使用每个路径最少的节点标识组，最后对组进行排序

以下是解决此图形漫游问题的方法：

with recursive cte as (
    select id, arr_val, array[id] path from mytable
    union all
    select t.id, t.arr_val, c.path || t.id
    from cte c
    inner join mytable t on t.arr_val && c.arr_val and not t.id = any(c.path)
)
select c.id, c.arr_val, dense_rank() over(order by min(x.id)) grp
from cte c
cross join lateral unnest(c.path) as x(id)
group by c.id, c.arr_val
order by c.id

您可以将其作为递归CTE处理。基于公共值定义ID之间的边。然后遍历边并聚合：

with recursive nodes as (
      select id, val
      from t cross join
           unnest(arr_val) as val
     ),
     edges as (
      select distinct n1.id as id1, n2.id as id2
      from nodes n1 join
           nodes n2
           on n1.val = n2.val
     ),
     cte as (
      select id1, id2, array[id1] as visited, 1 as lev
      from edges
      where id1 = id2
      union all
      select cte.id1, e.id2, visited || e.id2,
             lev + 1
      from cte join
           edges e
           on cte.id2 = e.id1
      where e.id2 <> all(cte.visited) 
     ),
     vals as (
      select id1, array_agg(distinct id2 order by id2) as id2s
      from cte
      group by id1
    )
select *, dense_rank() over (order by id2s) as grp
from vals;

是一个dbfiddle。

您可以将其作为递归CTE处理。基于公共值定义ID之间的边。然后遍历边并聚合：

with recursive nodes as (
      select id, val
      from t cross join
           unnest(arr_val) as val
     ),
     edges as (
      select distinct n1.id as id1, n2.id as id2
      from nodes n1 join
           nodes n2
           on n1.val = n2.val
     ),
     cte as (
      select id1, id2, array[id1] as visited, 1 as lev
      from edges
      where id1 = id2
      union all
      select cte.id1, e.id2, visited || e.id2,
             lev + 1
      from cte join
           edges e
           on cte.id2 = e.id1
      where e.id2 <> all(cte.visited) 
     ),
     vals as (
      select id1, array_agg(distinct id2 order by id2) as id2s
      from cte
      group by id1
    )
select *, dense_rank() over (order by id2s) as grp
from vals;

是一个数据小提琴。

而Gordon Linoffs解决方案是我发现的最快的用于少量数据的解决方案，在这些数据组不太大的情况下，它不适用于更大的数据集和更大的组。我改变了他的解决方案，使它起作用。我将边移动到索引表：创建表格边缘 id1整数不为空， id2整数不为空，约束-工作人员\u组\u节点\u主键主键id1，id2 ;

单凭这一点是没有帮助的。我也改变了他的递归部分：

with recursive
    cte as (
        select id1, array [id1] as visited
        from edges
        where id1 = id2
        union all
        select unnested.id1, array_agg(distinct unnested.vis) as visited
        from (
                 select cte.id1,
                        unnest(cte.visited || e.id2) as vis
                 from cte
                          join
                      staffel_group_edges e
                      on e.id1 = any (cte.visited)
                          and e.id2 <> all (cte.visited)) as unnested
        group by unnested.id1
    ),
    vals as (
        select id1, array_agg(distinct vis) as id2s
        from (
                 select cte.id1,
                        unnest(cte.visited) as vis
                 from cte) as unnested

        group by unnested.id1
    )
select id1,id2s, dense_rank() over (order by id2s) as grp
from vals;

每一步我都按搜索的起点对所有搜索进行分组。这大大减少了并行行走路径的数量，并且运行速度惊人。

而Gordon Linoffs解决方案是我发现的最快的解决方案，适用于组不太大的少量数据，它不适用于更大的数据集和更大的组。我改变了他的解决方案，使它起作用。我将边移动到索引表：创建表格边缘 id1整数不为空， id2整数不为空，约束-工作人员\u组\u节点\u主键主键id1，id2 ;

单凭这一点是没有帮助的。我也改变了他的递归部分：

with recursive
    cte as (
        select id1, array [id1] as visited
        from edges
        where id1 = id2
        union all
        select unnested.id1, array_agg(distinct unnested.vis) as visited
        from (
                 select cte.id1,
                        unnest(cte.visited || e.id2) as vis
                 from cte
                          join
                      staffel_group_edges e
                      on e.id1 = any (cte.visited)
                          and e.id2 <> all (cte.visited)) as unnested
        group by unnested.id1
    ),
    vals as (
        select id1, array_agg(distinct vis) as id2s
        from (
                 select cte.id1,
                        unnest(cte.visited) as vis
                 from cte) as unnested

        group by unnested.id1
    )
select id1,id2s, dense_rank() over (order by id2s) as grp
from vals;

每一步我都按搜索的起点对所有搜索进行分组。这大大减少了平行行走路径的数量，速度惊人。

为什么设置lev<8？@goodsnek。我把它拿走了。这是我用来调试的东西，有时会忘记删除。@goodsnek。你不接受这个答案有什么原因吗？是的，虽然这个解决方案适用于这个例子，但它不能扩展。如果我的数据中只有一个组有20个成员是完全连接的，例如20个条目的arr_val为{1}，那么您的解决方案将遍历所有可能的路径，该路径的能力为20，并且只在最后将其聚合。一个有效的解决方案需要一种方法来减少行走路径或并行搜索的数量。顺便说一句，我可以稍微优化一下您的方法。您不必取消节点的测试：nodesid、val作为选择id、arr_val来自测试猫、Edge作为选择n1.id作为id1、n2.id作为id2来自节点n1连接n1.val和n2.val上的节点n2，为什么设置lev<8？@goodsnek。我把它拿走了。这是我用来调试的东西，有时会忘记删除。@goodsnek。你不接受这个答案有什么原因吗？是的，虽然这个解决方案适用于这个例子，但它不能扩展。如果我的数据中只有一个组有20个成员是完全连接的，例如20个条目的arr_val为{1}，那么您的解决方案将遍历所有可能的路径，该路径的能力为20，并且只在最后将其聚合。有效的解决方案需要一种减少行走路径数量的方法

或者并行搜索。顺便说一下，我可以稍微优化一下你的方法。您不必取消节点的测试：nodesid、val作为选择id、arr_val来自测试猫、Edge作为选择n1.id作为id1、n2.id作为id2来自节点n1连接n1.val和n2.val上的节点n2，