Postgresql Postgres中WHERE in（…）和ORDER BY的高效查询/索引_Postgresql_Indexing

Postgresql Postgres中WHERE in（…）和ORDER BY的高效查询/索引

postgresql indexing

Postgresql Postgres中WHERE in（…）和ORDER BY的高效查询/索引,postgresql,indexing,Postgresql,Indexing,我有一张帖子表，每个帖子都属于一个教室。我希望能够在几个教室中查询最新的帖子，如下所示： SELECT * FROM posts WHERE posts.classroom_id IN (6691, 6693, 6695, 6702) ORDER BY date desc, created_at desc LIMIT 30; 不幸的是，这导致了博士后们要收集和整理成千上万的记录——它必须获得每个教室的所有帖子，并将它们全部排序在一起，才能找到最近的30条记录下面是解释+分析： -&

我有一张帖子表，每个帖子都属于一个教室。我希望能够在几个教室中查询最新的帖子，如下所示：

SELECT * FROM posts 
WHERE posts.classroom_id IN (6691, 6693, 6695, 6702) 
ORDER BY date desc, created_at desc 
LIMIT 30;

不幸的是，这导致了博士后们要收集和整理成千上万的记录——它必须获得每个教室的所有帖子，并将它们全部排序在一起，才能找到最近的30条记录

下面是解释+分析：

  ->  Sort  (cost=67525.77..67571.26 rows=18194 width=489) (actual time=9373.376..9373.381 rows=30 loops=1)
        Sort Key: date DESC, created_at DESC
        Sort Method: top-N heapsort  Memory: 62kB
        ->  Bitmap Heap Scan on posts  (cost=350.74..66988.42 rows=18194 width=489) (actual time=41.360..9271.782 rows=42924 loops=1)
              Recheck Cond: (classroom_id = ANY ('{6691,6693,6695,6702}'::integer[]))
              Heap Blocks: exact=29456
              ->  Bitmap Index Scan on optimize_finding_photos_and_tagged_posts_by_classroom  (cost=0.00..346.19 rows=18194 width=0) (actual time=16.205..16.205 rows=42924 loops=1)
                    Index Cond: (classroom_id = ANY ('{6691,6693,6695,6702}'::integer[]))
Planning time: 0.216 ms
Execution time: 9390.323 ms

从各种索引选项中，计划者选择了一个以教室id开头的选项，这很有意义（该索引中的后续字段不相关）。但它似乎效率很低，必须收集42924行数据并对它们进行排序

它似乎可以走一条捷径，只检索每个教室最近的30个，然后对它们进行排序。为了促进这一点，我尝试在[教室id，日期描述，在描述处创建]上添加一个新索引，但计划者选择不使用它。Postgres是否还不够聪明，无法使用我描述的快捷方式？还是我忽略了什么

那么，有没有更好的索引或查询方法，使这种查找更有效

还有一个附带问题：在解释+分析中，为什么排序节点花费的时间这么少？我希望排序相当慢/昂贵。

创建测试数据库

CREATE TABLE posts( classroom_id INT NOT NULL, date FLOAT NOT NULL, foo TEXT );
INSERT INTO posts SELECT random()*100, random() FROM generate_series( 1,1500000 );
CREATE INDEX posts_cd ON posts( classroom_id, date );
CREATE INDEX posts_date ON posts( date );
VACUUM ANALYZE posts;

注意，“foo”列的存在是为了避免对文章进行仅索引的扫描，这在这个测试设置中会非常快，它只包含索引列，id，date，但对您来说是无用的，因为您还将选择其他列

如果您有一个日期索引用于其他用途，例如显示所有教室的最新帖子，那么您也可以在此处使用它：

EXPLAIN ANALYZE SELECT * FROM posts WHERE posts.classroom_id IN (1,2,6)
ORDER BY date desc LIMIT 30;
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.29..55.67 rows=30 width=44) (actual time=0.040..0.983 rows=30 loops=1)
   ->  Index Scan Backward using posts_date on posts  (cost=0.29..5447.29 rows=2951 width=44) (actual time=0.039..0.978 rows=30 loops=1)
         Filter: (classroom_id = ANY ('{1,2,6}'::integer[]))
         Rows Removed by Filter: 916
 Planning time: 0.117 ms
 Execution time: 1.008 ms

这是有点危险的，因为教室上的条件没有索引：因为它会向后扫描日期索引，如果许多被WHERE条件排除在外的教室最近有帖子，那么在找到请求的行之前，它可能必须跳过索引中的许多行。我的测试数据分布是随机的，但如果您的数据分布不同，则此查询可能具有不同的性能

现在，没有日期索引

                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10922.61..10922.69 rows=30 width=44) (actual time=41.038..41.049 rows=30 loops=1)
   ->  Sort  (cost=10922.61..11028.44 rows=42331 width=44) (actual time=41.036..41.040 rows=30 loops=1)
         Sort Key: date DESC
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Bitmap Heap Scan on posts  (cost=981.34..9672.39 rows=42331 width=44) (actual time=10.275..33.056 rows=44902 loops=1)
               Recheck Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
               Heap Blocks: exact=8069
               ->  Bitmap Index Scan on posts_cd  (cost=0.00..970.76 rows=42331 width=0) (actual time=8.613..8.613 rows=44902 loops=1)
                     Index Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
 Planning time: 0.145 ms
 Execution time: 41.086 ms

注意，我已经调整了表中的行数，因此位图扫描找到的行数与您的行数大致相同

这与您的计划相同，包括Top-N heapsort，它比完整排序快得多（并且使用的内存要少得多）：

还有一个附带问题：在解释+分析中，为什么排序节点花费的时间这么少

基本上，它只保留heapsort缓冲区中的前N行，因为剩余的行将被丢弃，因此它不必对所有行进行排序。在提取行时，它们被推送到heapsort缓冲区（或者如果它们无论如何都会被限制丢弃，则被丢弃）。因此，排序并不是在收集要排序的数据之后作为单独的步骤进行的，而是在收集数据时进行的，这就是为什么它与检索数据花费相同的时间

现在，我的查询比你的快得多，而他们使用的是相同的计划。有几个原因可以解释这一点，例如我在SSD上运行它，速度很快。但我认为最可能的解释是，您的posts表可能包含。。。帖子。。。这意味着大量的文本数据。这意味着必须提取大量数据，然后丢弃，只保留30行。为了测试这一点，我刚刚做了：

UPDATE posts SET foo= 992 bytes of text
VACUUM ANALYZE posts;

…查询速度要慢得多，360毫秒，它说：

Heap Blocks: exact=41046

所以这可能是你的问题。为了解决这个问题，查询不应该获取大量数据然后丢弃它们，这意味着我们将使用主键。。。你一定已经有了，但我忘了，所以在这里

ALTER TABLE posts ADD post_id SERIAL PRIMARY KEY;
VACUUM ANALYZE posts;
DROP INDEX posts_cd;
CREATE INDEX posts_cdi ON posts( classroom_id, date, post_id );

我将PK添加到索引中，并删除上一个索引，因为我希望只扫描索引，以避免从表中获取所有数据。只扫描索引所涉及的数据要少得多，因为它不包含实际的帖子。当然，我们只获取PKs，因此我们必须连接回主表以获取post，但这只有在完成所有筛选之后才会发生，因此只有30行

EXPLAIN ANALYZE SELECT p.* FROM posts p 
JOIN (SELECT post_id FROM posts WHERE posts.classroom_id IN (1,2,6)
ORDER BY date desc LIMIT 30) pids USING (post_id)
ORDER BY date desc LIMIT 30;

 Limit  (cost=3212.05..3212.12 rows=30 width=1012) (actual time=38.410..38.421 rows=30 loops=1)
   ->  Sort  (cost=3212.05..3212.12 rows=30 width=1012) (actual time=38.410..38.419 rows=30 loops=1)
         Sort Key: p.date DESC
         Sort Method: quicksort  Memory: 85kB
         ->  Nested Loop  (cost=2957.71..3211.31 rows=30 width=1012) (actual time=38.108..38.329 rows=30 loops=1)
               ->  Limit  (cost=2957.29..2957.36 rows=30 width=12) (actual time=38.092..38.105 rows=30 loops=1)
                     ->  Sort  (cost=2957.29..3067.84 rows=44223 width=12) (actual time=38.092..38.104 rows=30 loops=1)
                           Sort Key: posts.date DESC
                           Sort Method: top-N heapsort  Memory: 26kB
                           ->  Index Only Scan using posts_cdi on posts  (cost=0.43..1651.19 rows=44223 width=12) (actual time=0.023..22.186 rows=44902 loops=1)
                                 Index Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
                                 Heap Fetches: 0
               ->  Index Scan using posts_pkey on posts p  (cost=0.43..8.45 rows=1 width=1012) (actual time=0.006..0.006 rows=1 loops=30)
                     Index Cond: (post_id = posts.post_id)
 Planning time: 0.305 ms
 Execution time: 38.468 ms

嗯。现在快多了。这个技巧非常有用：当表包含很多数据，甚至很多列时，这些数据必须在查询引擎中拖拽，然后进行过滤，大部分被丢弃，有时只对实际使用的少数小列进行过滤和排序会更快，然后仅为过滤完成后剩余的行获取其余数据。有时，将表拆分为两个偶数是值得的，用于筛选和排序的列在一个表中，其余的列在另一个表中

为了更快，我们可以让查询变得丑陋：

SELECT p.* FROM posts p
    JOIN (
      SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=1 ORDER BY date desc LIMIT 30) a
      UNION ALL
      SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=2 ORDER BY date desc LIMIT 30) b
      UNION ALL
      SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=3 ORDER BY date desc LIMIT 30) c
      ORDER BY date desc LIMIT 30
    ) q USING (post_id)
    ORDER BY date desc LIMIT 30;

这利用了这样一个事实，即如果WHERE条件中只有一个教室id，那么postgres将直接在（教室id，日期）上使用索引向后扫描。由于我已经添加了post_id，它甚至不需要触摸桌子。由于联合中的三个选择具有相同的排序顺序，因此它将它们与合并结合起来，这意味着它甚至不需要对被外部限制30截断的行进行排序或提取

 Limit  (cost=257.97..258.05 rows=30 width=1012) (actual time=0.357..0.367 rows=30 loops=1)
   ->  Sort  (cost=257.97..258.05 rows=30 width=1012) (actual time=0.356..0.364 rows=30 loops=1)
         Sort Key: p.date DESC
         Sort Method: quicksort  Memory: 85kB
         ->  Nested Loop  (cost=1.73..257.23 rows=30 width=1012) (actual time=0.063..0.319 rows=30 loops=1)
               ->  Limit  (cost=1.31..3.28 rows=30 width=12) (actual time=0.050..0.085 rows=30 loops=1)
                     ->  Merge Append  (cost=1.31..7.24 rows=90 width=12) (actual time=0.049..0.081 rows=30 loops=1)
                           Sort Key: posts.date DESC
                           ->  Limit  (cost=0.43..1.56 rows=30 width=12) (actual time=0.024..0.032 rows=12 loops=1)
                                 ->  Index Only Scan Backward using posts_cdi on posts  (cost=0.43..531.81 rows=14136 width=12) (actual time=0.024..0.029 rows=12 loops=1)
                                       Index Cond: (classroom_id = 1)
                                       Heap Fetches: 0
                           ->  Limit  (cost=0.43..1.55 rows=30 width=12) (actual time=0.018..0.024 rows=9 loops=1)
                                 ->  Index Only Scan Backward using posts_cdi on posts posts_1  (cost=0.43..599.55 rows=15950 width=12) (actual time=0.017..0.023 rows=9 loops=1)
                                       Index Cond: (classroom_id = 2)
                                       Heap Fetches: 0
                           ->  Limit  (cost=0.43..1.56 rows=30 width=12) (actual time=0.006..0.014 rows=11 loops=1)
                                 ->  Index Only Scan Backward using posts_cdi on posts posts_2  (cost=0.43..531.81 rows=14136 width=12) (actual time=0.006..0.014 rows=11 loops=1)
                                       Index Cond: (classroom_id = 3)
                                       Heap Fetches: 0
               ->  Index Scan using posts_pkey on posts p  (cost=0.43..8.45 rows=1 width=1012) (actual time=0.006..0.007 rows=1 loops=30)
                     Index Cond: (post_id = posts.post_id)
 Planning time: 0.445 ms
 Execution time: 0.432 ms

由此产生的加速相当可笑。我认为这应该行得通

为了促进这一点，我尝试在[教室id，日期描述，在描述处创建]上添加一个新索引，但计划者选择不使用它。Postgres是否还不够聪明，无法使用我描述的快捷方式

这还不够聪明。您可以显式地写出它，以获得您所设想的执行。这是丑陋的，但它应该是有效的：

(SELECT * FROM posts WHERE classroom_id = 6691 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6693 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6695 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6697 ORDER BY date desc, created_at desc LIMIT 30)
order by date desc, created_at desc LIMIT 30;

还有一个附带问题：在解释+分析中，为什么排序节点花费的时间这么少？我预计分拣会相当慢/昂贵

CPU速度非常快，40000行并不多。然而，与CPU不同的是，您的存储速度没有CPU快，而且