Postgresql 使用Postgres中的重复序列和分页，在分类列上均匀地选择记录_Postgresql

Postgresql 使用Postgres中的重复序列和分页，在分类列上均匀地选择记录

postgresql

Postgresql 使用Postgres中的重复序列和分页，在分类列上均匀地选择记录,postgresql,Postgresql,数据库：Postgres 我有一个产品（id、title、source等）表，其中包含近500K条记录。数据的一个例子是： | Id | title | source | |:---|---------:|:--------:| | 1 | product1 | source1 | | 2 | product2 | source1 | | 3 | product3 | source1 | | 4 | product4 | source1 | | . | ......

数据库：Postgres
我有一个

产品（id、title、source等）

表，其中包含近500K条记录。数据的一个例子是：

| Id | title    | source   |
|:---|---------:|:--------:|
| 1  | product1 | source1  |
| 2  | product2 | source1  |
| 3  | product3 | source1  |
| 4  | product4 | source1  |
| .  | ........ | source1  |
| .  | ........ | source2  |
| x  | productx | source2  |
|x+n |productX+n| sourceN  |

有5个不同的源值。所有记录的源值都是随机的

我需要以如下方式获得分页结果：如果我需要选择20个产品，那么结果集应该包含基于源平均分布的结果，并且应该是重复序列。每个来源2个产品，直到最后一个来源，然后每个来源2个产品。例如：

| #  | title    | source   |
|:---|---------:|:--------:|
| 1  | product1 | source1  |
| 2  | product2 | source1  |
| 3  | product3 | source2  |
| 4  | product4 | source2  |
| 5  | product5 | source3  |
| 6  | product6 | source3  |
| 7  | product7 | source4  |
| 8  | product8 | source4  |
| 9  | product9 | source5  |
| 10 |product10 | source5  |
| 11 | ........ | source1  |
| 12 | ........ | source1  |
| 13 | ........ | source2  |
| 14 | ........ | source2  |
| .. | ........ | .......  |
| 20 | ........ | source5  |

考虑到限制、偏移、源可以增加或减少，实现上述场景的优化PgSql查询是什么

编辑
正如所建议的，下面的解决方案是可行的，但是性能较差。仅选择20条记录几乎需要6秒钟

select id, title, source
, (row_number() over(partition by source order by last_modified DESC) - 1) / 2 as ordinal 
   -- order here can be by created time, id, title, etc
from product p
order by ordinal, source
limit 20
offset 2;

解释分析上述查询的真实数据

Limit (cost=147621.60..147621.65 rows=20 width=92) (actual time=5956.126..5956.138 rows=20 loops=1) -> Sort (cost=147621.60..148813.72 rows=476848 width=92) (actual time=5956.123..5956.128 rows=22 loops=1) Sort Key: (((row_number() OVER (?) - 1) / 2)), provider Sort Method: top-N heapsort Memory: 28kB -> WindowAgg (cost=122683.80..134605.00 rows=476848 width=92) (actual time=5099.059..5772.821 rows=477731 loops=1) -> Sort (cost=122683.80..123875.92 rows=476848 width=84) (actual time=5098.873..5347.858 rows=477731 loops=1) Sort Key: provider, last_modified DESC Sort Method: external merge Disk: 46328kB -> Seq Scan on product p (cost=0.00..54889.48 rows=476848 width=84) (actual time=0.012..4360.000 rows=477731 loops=1) Planning Time: 0.354 ms Execution Time: 5961.670 ms

这可以通过窗口功能轻松实现：

select id, title, source , (row_number() over(partition by source order by id) - 1) / 2 as ordinal --ordering here can be by created time, id, title, etc from product p order by ordinal, source limit 10 offset 2;
正如您所指出的，这取决于您的表大小和使用的其他筛选器，可能会执行，也可能不会执行。最好的方法是对实际数据进行解释分析。如果没有执行此操作，还可以将序号字段添加到表本身，前提是该字段的值/顺序始终相同。遗憾的是，您不能使用窗口函数创建索引（至少在PG12中不能）
如果不想更改表本身，可以创建一个物化视图，然后查询该视图，以便只需执行一次计算：

CREATE MATERIALIZED VIEW ordered_product AS select id, title, source , (row_number() over(partition by source order by id) - 1) / 2 as ordinal from product;
之后，您可以像查询普通表一样查询视图：

select * from ordered_product order by ordinal, source limit 10 offset 20;
如有必要，还可以为其创建索引。请注意，要刷新视图，请运行以下命令：

REFRESH MATERIALIZED VIEW ordered_product;

谢谢，@George S，这很有效。然而，这并没有实现。只选择20条记录几乎需要5秒钟。我不明白这句话
如果序号字段的值/顺序始终相同，您也可以将其添加到表中。你能举个例子吗？我用实际数据的解释分析编辑了这个问题。请检查。关于，您也可以将序号字段添加到表本身，如果它始终是相同的值/顺序，我的意思是，如果您想更快地查询，您可以向产品表中添加一列，该列将存储“序号”信息，这样数据库就不需要每次计算结果。您还可以使用物化视图--请参见