PostgreSQL选择哈希连接而不是索引扫描_Postgresql_Query Optimization_Sql Execution Plan_Postgresql Performance

PostgreSQL选择哈希连接而不是索引扫描

postgresql

PostgreSQL选择哈希连接而不是索引扫描,postgresql,query-optimization,sql-execution-plan,postgresql-performance,Postgresql,Query Optimization,Sql Execution Plan,Postgresql Performance,我正在运行PostgreSQL 12.6版。我有一个包含14007206行的表delivery\u info（为了简洁起见，我删除了DDL中不相关的部分）：和通知用户，共3550313行： create table if not exists notification_user ( id bigserial not null constraint notification_user__id__seq primary key, ... )

我正在运行PostgreSQL 12.6版。我有一个包含14007206行的表

delivery\u info

（为了简洁起见，我删除了DDL中不相关的部分）：

和

通知用户

，共3550313行：

create table if not exists notification_user
(
    id bigserial not null
        constraint notification_user__id__seq
            primary key,
    ...
);

该查询使用

delivery\u info

上的WHERE子句对两个表进行联接：

SELECT *
FROM delivery_info AS d
INNER JOIN notification_user AS n ON d.user_notification_id = n.id
WHERE d.status = 1 AND d.acknowledged = false AND d.status_change_date < '2021-04-16T13:48:00.2234239Z';

Gather  (cost=1211782.75..1987611.05 rows=2293631 width=122) (actual time=49921.908..123141.788 rows=2790 loops=1)
  Workers Planned: 4
  Workers Launched: 4
  Buffers: shared hit=24996 read=412218
  I/O Timings: read=317223.835
  ->  Parallel Hash Join  (cost=211782.75..758247.95 rows=573408 width=122) (actual time=49923.633..123072.227 rows=558 loops=5)
        Hash Cond: (n.id = d.user_notification_id)
        Buffers: shared hit=24993 read=412218
        I/O Timings: read=317223.835
        ->  Parallel Seq Scan on notification_user n  (cost=0.00..511671.22 rows=8896122 width=75) (actual time=9.874..90448.053 rows=7100603 loops=5)
              Buffers: shared hit=10492 read=412218
              I/O Timings: read=317223.835
        ->  Parallel Hash  (cost=204615.15..204615.15 rows=573408 width=47) (actual time=210.255..210.262 rows=558 loops=5)
              Buckets: 4194304  Batches: 1  Memory Usage: 33056kB
              Buffers: shared hit=14386
              ->  Parallel Bitmap Heap Scan on delivery_info d  (cost=43803.04..204615.15 rows=573408 width=47) (actual time=187.358..188.670 rows=558 loops=5)
                    Recheck Cond: ((status_change_date < '2021-04-16 13:48:00.223424'::timestamp without time zone) AND (status = 1))
                    Filter: (NOT acknowledged)
                    Heap Blocks: exact=87
                    Buffers: shared hit=14386
                    ->  Bitmap Index Scan on delivery_info__status_change_date_acknowledged__index  (cost=0.00..43229.63 rows=2293631 width=0) (actual time=182.445..182.447 rows=2790 loops=1)
                          Index Cond: ((status_change_date < '2021-04-16 13:48:00.223424'::timestamp without time zone) AND (acknowledged = false))
                          Buffers: shared hit=14259
Planning Time: 57.240 ms
Execution Time: 123147.866 ms

使用

设置enable\u seqscan=off分析的同一查询：
Gather  (cost=1043803.60..2525242.24 rows=2293631 width=122) (actual time=156.124..186.178 rows=2790 loops=1)
  Workers Planned: 4
  Workers Launched: 4
  Buffers: shared hit=28349
  ->  Nested Loop  (cost=43803.60..1295879.14 rows=573408 width=122) (actual time=124.191..137.654 rows=558 loops=5)
        Buffers: shared hit=28349
        ->  Parallel Bitmap Heap Scan on delivery_info d  (cost=43803.04..204615.15 rows=573408 width=47) (actual time=124.141..125.410 rows=558 loops=5)
              Recheck Cond: ((status_change_date < '2021-04-16 13:48:00.223424'::timestamp without time zone) AND (status = 1))
              Filter: (NOT acknowledged)
              Heap Blocks: exact=57
              Buffers: shared hit=14386
              ->  Bitmap Index Scan on delivery_info__status_change_date_acknowledged__index  (cost=0.00..43229.63 rows=2293631 width=0) (actual time=155.243..155.245 rows=2790 loops=1)
                    Index Cond: ((status_change_date < '2021-04-16 13:48:00.223424'::timestamp without time zone) AND (acknowledged = false))
                    Buffers: shared hit=14259
        ->  Index Scan using notification_user__id__seq on notification_user n  (cost=0.56..1.90 rows=1 width=75) (actual time=0.007..0.007 rows=1 loops=2790)
              Index Cond: (id = d.user_notification_id)
              Buffers: shared hit=13963
Planning Time: 1.061 ms
Execution Time: 190.706 ms

这两个查询（带和不带交货信息.状态更改\u日期的下限）都返回2790个结果。
显然，问题在于查询计划器假设status\u change\u date
上的子句是非选择性的，尽管满足查询中所有子句的行相对较少。如何优化此行为？我不希望在状态\u更改\u日期
上设置下限
我对交付信息进行了真空分析
，我还检查了seq\u page\u cost
和random\u page\u cost
（都设置为1）。在运行ANALYZE之前，尝试在<代码>状态更改日期
上增加<代码>统计信息
，并增加<代码>默认统计信息>目标，但均无效
编辑：
根据@jjane的建议，我添加了where子句表达式的不同组合的实际和估计计数：
clause                                                                                              actual      estimated
d.status = 1 AND d.acknowledged = false AND d.status_change_date < '2021-04-16T13:48:00.2234239Z'   2790        2295101
d.status = 1 AND d.acknowledged = false AND d.status_change_date > '2021-04-16T13:48:00.2234239Z'   119         571
d.status = 1 AND d.acknowledged != false AND d.status_change_date < '2021-04-16T13:48:00.2234239Z'  2891204     596341
d.status = 1 AND d.acknowledged != false AND d.status_change_date > '2021-04-16T13:48:00.2234239Z'  0           148
d.status != 1 AND d.acknowledged = false AND d.status_change_date < '2021-04-16T13:48:00.2234239Z'  11113008    8820447
d.status != 1 AND d.acknowledged = false AND d.status_change_date > '2021-04-16T13:48:00.2234239Z'  3           2193
d.status != 1 AND d.acknowledged != false AND d.status_change_date < '2021-04-16T13:48:00.2234239Z' 82          2291834
d.status != 1 AND d.acknowledged != false AND d.status_change_date > '2021-04-16T13:48:00.2234239Z' 0           570

条款实际估算
d、 状态=1，d.确认=错误，d.状态更改日期<'2021-04-16T13:48:00.2234239Z'2790 2295101
d、 状态=1和d.确认=false和d.状态更改日期>'2021-04-16T13:48:00.2234239Z'119 571
d、 状态=1和d。已确认！=错误和d.状态变更日期<'2021-04-16T13:48:00.2234239Z'2891204 596341
d、 状态=1和d。已确认！=错误和d.状态更改日期>'2021-04-16T13:48:00.2234239Z'0 148
d、 地位！=1和d.确认=错误和d.状态更改日期<'2021-04-16T13:48:00.2234239Z'11113008 8820447
d、 地位！=1和d.已确认=错误和d.状态更改日期>'2021-04-16T13:48:00.2234239Z'3 2193
d、 地位！=1和d.已确认！=错误和d.状态变更日期<'2021-04-16T13:48:00.2234239Z'82 2291834
d、 地位！=1和d.已确认！=错误和d.状态更改日期>'2021-04-16T13:48:00.2234239Z'0570

看起来估计的数字有点离谱。我已经分析了一个多月了，我错过了什么？
首先引起我注意的是这个索引：
create index if not exists delivery_info__status_change_date_acknowledged__index
    on delivery_info (status asc, status_change_date desc, acknowledged asc)
    where (status = 1);

如果所有值都具有相同的值：“where（status=1）”，则没有必要添加“status asc”
我将合并这两个索引，并首先尝试此索引：
create index if not exists delivery_info__status_change_date_acknowledged__index
    on delivery_info (status_change_date desc, user_notification_id, acknowledged)
    where (status = 1);

另一件可能有帮助的事情是创建一些额外的索引。
首先引起我注意的是以下索引：
create index if not exists delivery_info__status_change_date_acknowledged__index
    on delivery_info (status asc, status_change_date desc, acknowledged asc)
    where (status = 1);

如果所有值都具有相同的值：“where（status=1）”，则没有必要添加“status asc”
我将合并这两个索引，并首先尝试此索引：
create index if not exists delivery_info__status_change_date_acknowledged__index
    on delivery_info (status_change_date desc, user_notification_id, acknowledged)
    where (status = 1);

另一件可能有帮助的事情是创建一些额外的索引。
在索引中添加列并重新排序列应该会有所帮助
查询的WHERE子句在delivery\u info
表上执行这些筛选
WHERE d.status = 1
  AND d.acknowledged = false
  AND d.status_change_date < timeconstant;

为什么?？查询可以将索引随机访问到第一个符合条件的条目，然后通过扫描索引从该表中完全满足它的需要。作为额外的额外奖励，您用于fk的值将按升序排列，与您加入的表上的pk相匹配。这应该允许合并连接代替哈希连接，希望如此
您的确认
列应位于索引中的状态_更改_日期
列之前，因为您过滤前者的相等性和后者的范围
Pro提示：SELECT*
在这些情况下可能对性能有害，因为它会强制查询检索您可能不需要的列。在SELECT
子句中列出所需的列。
在索引中添加列并对列重新排序应该会有所帮助
查询的WHERE子句在delivery\u info
表上执行这些筛选
WHERE d.status = 1
  AND d.acknowledged = false
  AND d.status_change_date < timeconstant;

为什么?？查询可以将索引随机访问到第一个符合条件的条目，然后通过扫描索引从该表中完全满足它的需要。作为额外的额外奖励，您用于fk的值将按升序排列，与您加入的表上的pk相匹配。这应该允许合并连接代替哈希连接，希望如此
您的确认
列应位于索引中的状态_更改_日期
列之前，因为您过滤前者的相等性和后者的范围
Pro提示：SELECT*
在这些情况下可能对性能有害，因为它会强制查询检索您可能不需要的列。在SELECT
子句中列出所需的列。
您使用
SELECT *

在您的查询中，为我们提供了表结构的一部分
如果不是查询中使用的所有列都在索引定义中，则没有索引是有效的
所以问题是：您真的需要返回所有列吗？如果是，索引必须包含表中的所有列，在这种情况下，必须使用新的INCLUDE子句（为Microsoft SQL Server而发明），否则，将SELECT语句的SELECT子句中的列列表重新粘贴到所需的最小列子集
顺便说一句，请始终提供完整的DDL代码，以便使用
SELECT *

在您的查询中，为我们提供了表结构的一部分
如果不是查询中使用的所有列都在索引定义中，则没有索引是有效的
所以问题是：你真的需要所有的专栏吗