Postgresql 查询中的看似随机的延迟
这是我不久前发布的后续内容 我有以下代码:Postgresql 查询中的看似随机的延迟,postgresql,postgresql-10,Postgresql,Postgresql 10,这是我不久前发布的后续内容 我有以下代码: SET work_mem = '16MB'; SELECT s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset FROM rm_o_resource_usage_instance_splits_new s INNER JOIN rm_o_resource_usage r ON s.usage_
SET work_mem = '16MB';
SELECT s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
FROM rm_o_resource_usage_instance_splits_new s
INNER JOIN rm_o_resource_usage r ON s.usage_id = r.id
INNER JOIN scheduledactivities sa ON s.activity_index = sa.activity_index AND r.schedule_id = sa.solution_id and s.solution = sa.solution_id
WHERE r.schedule_id = 10
ORDER BY r.resource_id, s.start_date
当我运行EXPLAIN(分析,缓冲区)
时,我得到以下信息:
Sort (cost=3724.02..3724.29 rows=105 width=89) (actual time=245.802..247.573 rows=22302 loops=1)
Sort Key: r.resource_id, s.start_date
Sort Method: quicksort Memory: 6692kB
Buffers: shared hit=198702 read=5993 written=612
-> Nested Loop (cost=703.76..3720.50 rows=105 width=89) (actual time=1.898..164.741 rows=22302 loops=1)
Buffers: shared hit=198702 read=5993 written=612
-> Hash Join (cost=703.34..3558.54 rows=105 width=101) (actual time=1.815..11.259 rows=22302 loops=1)
Hash Cond: (s.usage_id = r.id)
Buffers: shared hit=3 read=397 written=2
-> Bitmap Heap Scan on rm_o_resource_usage_instance_splits_new s (cost=690.61..3486.58 rows=22477 width=69) (actual time=1.782..5.820 rows=22302 loops=1)
Recheck Cond: (solution = 10)
Heap Blocks: exact=319
Buffers: shared hit=2 read=396 written=2
-> Bitmap Index Scan on rm_o_resource_usage_instance_splits_new_solution_idx (cost=0.00..685.00 rows=22477 width=0) (actual time=1.609..1.609 rows=22302 loops=1)
Index Cond: (solution = 10)
Buffers: shared hit=2 read=77
-> Hash (cost=12.66..12.66 rows=5 width=48) (actual time=0.023..0.023 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1 read=1
-> Bitmap Heap Scan on rm_o_resource_usage r (cost=4.19..12.66 rows=5 width=48) (actual time=0.020..0.020 rows=1 loops=1)
Recheck Cond: (schedule_id = 10)
Heap Blocks: exact=1
Buffers: shared hit=1 read=1
-> Bitmap Index Scan on rm_o_resource_usage_sched (cost=0.00..4.19 rows=5 width=0) (actual time=0.017..0.017 rows=1 loops=1)
Index Cond: (schedule_id = 10)
Buffers: shared read=1
-> Index Scan using scheduledactivities_activity_index_idx on scheduledactivities sa (cost=0.42..1.53 rows=1 width=16) (actual time=0.004..0.007 rows=1 loops=22302)
Index Cond: (activity_index = s.activity_index)
Filter: (solution_id = 10)
Rows Removed by Filter: 5
Buffers: shared hit=198699 read=5596 written=610
Planning time: 7.070 ms
Execution time: 248.691 ms
每次运行EXPLAIN
,得到的结果大致相同。执行时间总是在170ms到250ms之间,这对我来说是非常好的。但是,当这个查询是通过C++项目运行的(使用<代码> PQexec(CONN,Queice)< /C> >代码> CONN/COD>是专用连接,而<代码>查询<代码>是上述查询时,所花费的时间似乎变化很大。一般来说,查询非常快速,您不会注意到延迟。问题是,有时候,这个查询需要2到3分钟才能完成
如果我打开pgadmin,查看数据库的“服务器活动”,大约有30个连接,大部分处于“空闲”状态。上述查询的连接被标记为“活动”,并将保持为“活动”数分钟
我不明白为什么完成同一个查询会随机花费几分钟,而数据库中的数据也没有变化。我试着增加工作量_mem
,但没有产生任何影响(我也没有真的期望如此)。如有任何帮助或建议,将不胜感激
没有任何更具体的标签,但我目前使用的是Postgres 10.11,但在其他版本的10.x上也是一个问题。该系统是一个Xeon四核@3.4Ghz,带有SSD和24GB内存
根据jjanes的建议,我加入了
auto\u explain
。最终输出如下:
duration: 128057.373 ms
plan:
Query Text: SET work_mem = '32MB';SELECT s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset FROM rm_o_resource_usage_instance_splits_new s INNER JOIN rm_o_resource_usage r ON s.usage_id = r.id INNER JOIN scheduledactivities sa ON s.activity_index = sa.activity_index AND r.schedule_id = sa.solution_id and s.solution = sa.solution_id WHERE r.schedule_id = 12642 ORDER BY r.resource_id, s.start_date
Sort (cost=14.36..14.37 rows=1 width=98) (actual time=128042.083..128043.287 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Sort Key: r.resource_id, s.start_date
Sort Method: quicksort Memory: 6585kB
Buffers: shared hit=21198435 read=388 dirtied=119
-> Nested Loop (cost=0.85..14.35 rows=1 width=98) (actual time=4.995..127958.935 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, r.resource_id, sa.usedresourceset
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 705476285
Buffers: shared hit=21198435 read=388 dirtied=119
-> Nested Loop (cost=0.42..9.74 rows=1 width=110) (actual time=0.091..227.705 rows=21899 loops=1)
Output: s.start_date, s.end_date, s.resources, s.activity_index, s.solution, r.resource_id, r.schedule_id
Inner Unique: true
Join Filter: (s.usage_id = r.id)
Buffers: shared hit=22102 read=388 dirtied=119
-> Index Scan using rm_o_resource_usage_instance_splits_new_solution_idx on public.rm_o_resource_usage_instance_splits_new s (cost=0.42..8.44 rows=1 width=69) (actual time=0.082..17.418 rows=21899 loops=1)
Output: s.start_time, s.end_time, s.resources, s.activity_index, s.usage_id, s.start_date, s.end_date, s.solution
Index Cond: (s.solution = 12642)
Buffers: shared hit=203 read=388 dirtied=119
-> Seq Scan on public.rm_o_resource_usage r (cost=0.00..1.29 rows=1 width=57) (actual time=0.002..0.002 rows=1 loops=21899)
Output: r.id, r.schedule_id, r.resource_id
Filter: (r.schedule_id = 12642)
Rows Removed by Filter: 26
Buffers: shared hit=21899
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.60 rows=1 width=16) (actual time=0.006..4.612 rows=32216 loops=21899)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12642)
Buffers: shared hit=21176333",,,,,,,,,""
编辑:表的完整定义如下:
CREATE TABLE public.rm_o_resource_usage_instance_splits_new
(
start_time integer NOT NULL,
end_time integer NOT NULL,
resources jsonb NOT NULL,
activity_index integer NOT NULL,
usage_id bigint NOT NULL,
start_date text COLLATE pg_catalog."default" NOT NULL,
end_date text COLLATE pg_catalog."default" NOT NULL,
solution bigint NOT NULL,
CONSTRAINT rm_o_resource_usage_instance_splits_new_pkey PRIMARY KEY (start_time, activity_index, usage_id),
CONSTRAINT rm_o_resource_usage_instance_splits_new_solution_fkey FOREIGN KEY (solution)
REFERENCES public.rm_o_schedule_stats (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE,
CONSTRAINT rm_o_resource_usage_instance_splits_new_usage_id_fkey FOREIGN KEY (usage_id)
REFERENCES public.rm_o_resource_usage (id) MATCH SIMPLE
ON UPDATE CASCADE
ON DELETE CASCADE
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_activity_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(activity_index ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_solution_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(solution ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX rm_o_resource_usage_instance_splits_new_usage_idx
ON public.rm_o_resource_usage_instance_splits_new USING btree
(usage_id ASC NULLS LAST)
TABLESPACE pg_default;
编辑:在scheduledactivities(解决方案id、活动索引)上添加索引
后,auto\u explain的额外输出
重现该问题的最简单方法是向三个表中添加更多值。我没有删除任何内容,只是插入了几千条 fast计划的SQL语句使用其中r.schedule_id=10,返回大约22000行(估计105行)。
slow计划的SQL语句使用其中r.schedule_id=12642并返回大约21000行(估计只有1行)
慢速计划使用嵌套循环而不是散列联接:可能是因为联接的估计错误:估计的行数为1,但实际行数为21899。
例如,在此步骤中:
Nested Loop (cost=0.42..9.74 rows=1 width=110) (actual time=0.091..227.705 rows=21899 loops=1)
如果数据没有更改,可能会出现某些列的统计问题(倾斜数据)。fast计划的SQL语句使用r.schedule_id=10,返回大约22000行(估计105行)。
-> Index Scan using .. s (cost=0.42..8.44 rows=1 width=69) (actual time=0.082..17.418 rows=21899 loops=1)
Index Cond: (s.solution = 12642)
slow计划的SQL语句使用其中r.schedule_id=12642并返回大约21000行(估计只有1行)
慢速计划使用嵌套循环而不是散列联接:可能是因为联接的估计错误:估计的行数为1,但实际行数为21899。
例如,在此步骤中:
Nested Loop (cost=0.42..9.74 rows=1 width=110) (actual time=0.091..227.705 rows=21899 loops=1)
如果数据不变,则某些列可能存在统计问题(倾斜数据)
-> Index Scan using .. s (cost=0.42..8.44 rows=1 width=69) (actual time=0.082..17.418 rows=21899 loops=1)
Index Cond: (s.solution = 12642)
计划者认为它将找到1行,取而代之的是21899行。这个错误很明显会导致糟糕的计划。一个等式条件应该是相当准确的估计,所以我想说你表上的统计数据是非常不准确的。可能是autovac launcher调整不当,因此运行频率不够高,也可能是您选择的数据部分变化非常快(您是否在运行查询之前立即插入了21899行,其中s.solution=12642?),因此统计数据无法保持足够准确
-> Nested Loop ...
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 705476285
-> ...
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.60 rows=1 width=16) (actual time=0.006..4.612 rows=32216 loops=21899)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12642)
如果您不能让它使用散列联接,至少可以通过在scheduledactivities(解决方案id、活动索引)
上构建索引,来减少嵌套循环的危害。这样,activity\u index
标准就可以成为索引条件的一部分,而不是连接筛选器。然后您可能会将索引专门放在solution\u id
上,因为维护这两个索引没有什么意义
计划者认为它将找到1行,取而代之的是21899行。这个错误很明显会导致糟糕的计划。一个等式条件应该是相当准确的估计,所以我想说你表上的统计数据是非常不准确的。可能是autovac launcher调整不当,因此运行频率不够高,也可能是您选择的数据部分变化非常快(您是否在运行查询之前立即插入了21899行,其中s.solution=12642?),因此统计数据无法保持足够准确
-> Nested Loop ...
Join Filter: (s.activity_index = sa.activity_index)
Rows Removed by Join Filter: 705476285
-> ...
-> Index Scan using scheduled_activities_idx on public.scheduledactivities sa (cost=0.42..4.60 rows=1 width=16) (actual time=0.006..4.612 rows=32216 loops=21899)
Output: sa.usedresourceset, sa.activity_index, sa.solution_id
Index Cond: (sa.solution_id = 12642)
如果您不能让它使用散列联接,至少可以通过在scheduledactivities(解决方案id、活动索引)
上构建索引,来减少嵌套循环的危害。这样,activity\u index
标准就可以成为索引条件的一部分,而不是连接筛选器。然后您可能会将索引专门放在solution\u id
上,因为维护这两个索引几乎没有意义。我会将auto\u explain
设置为log\u min\u duration
约30秒,这样您就可以在查询速度缓慢时捕获查询的执行计划。这将消除对查询速度慢时可能出现问题的大量猜测——您可以直接看到查询速度慢时它在做什么。@jjanes,我已经打开了它,将看到发生了什么。自从打开它以来,我已经试了二十多次了,现在每一次都没问题了!因此,我将继续尝试复制它,并将在/如果我可以的话更新问题。@jjanes,并使它再次发生,因此更新了问题Join Filter:(s.activity_index=sa.activity_index)被Join Filter删除的行:705476285
,这是大量被丢弃的结果。这应该是一个索引联接。请在问题中添加表的定义(包括PK/FK和secundary索引)。还要在表上运行真空分析。@wildplasser,我已更新了问题以包含表的定义