对50M+;PostgreSQL中的行表 问题陈述
我有“事件统计”表,定义如下:对50M+;PostgreSQL中的行表 问题陈述,sql,postgresql,query-optimization,aggregate,Sql,Postgresql,Query Optimization,Aggregate,我有“事件统计”表,定义如下: CREATE TABLE public.event_statistics ( id int4 NOT NULL DEFAULT nextval('event_statistics_id_seq'::regclass), client_id int4 NULL, session_id int4 NULL, action_name text NULL, value text NULL, product_id int8 N
CREATE TABLE public.event_statistics (
id int4 NOT NULL DEFAULT nextval('event_statistics_id_seq'::regclass),
client_id int4 NULL,
session_id int4 NULL,
action_name text NULL,
value text NULL,
product_id int8 NULL,
product_options jsonb NOT NULL DEFAULT '{}'::jsonb,
url text NULL,
url_options jsonb NOT NULL DEFAULT '{}'::jsonb,
visit int4 NULL DEFAULT 0,
date_update timestamptz NULL,
CONSTRAINT event_statistics_pkey PRIMARY KEY (id),
CONSTRAINT event_statistics_client_id_session_id_sessions_client_id_id_for
FOREIGN KEY
(client_id,session_id) REFERENCES <?>() ON DELETE CASCADE ON UPDATE CASCADE
)
WITH (
OIDS=FALSE
) ;
CREATE INDEX regdate ON public.event_statistics (date_update
timestamptz_ops) ;
我需要的是在“event_statistics”(事件统计)表中获得特定“date_update”(日期更新)范围分组的每个“action_name”(操作名称)类型的事件数,以及特定客户端的所有事件数
我们的目标是在我们网站的仪表板上为每个客户提供所有相关事件的统计信息,并选择报告日期,根据时间间隔,图表中的步骤应不同,如:
- 当前日期-每小时计数李>
- 1天以上和第一步:在子查询中执行预聚合:
下一步:将值放入CTE,并在聚合子查询中引用它。 (增益取决于可以跳过的操作名称的数量)
更新:使用fysical(temp)表将产生更好的估计
更新#3(很抱歉,我在这里为基表创建索引,您需要编辑。我还删除了TimeStamp上的一列)
痛苦似乎在于对低基数文本列action\u name进行排序。(个人而言,我更喜欢这里的数字action\u id)另外,(func)日历表和(values)action\u name preudo表都没有可用的优化挂钩(索引、统计),我将它们具体化为(TEMP)表谢谢你的提示。是的,问题似乎在于外部磁盘排序和读取所有客户端数据的速度较慢。但由于某种原因,即使像我在文章末尾所写的那样使用覆盖索引,我也无法消除排序的需要。只有当我充分增加“work\u mem”时,使用这样的索引会快得多使用了内存内排序,但仍然不够,因为读取“event_statistics”表的速度较慢。IMO您可以在子查询中预聚合。它不会生成超过1600个聚合。查询计划 ---------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate(成本=8.33..8.35行=1宽=17) 组键:gr.action\u name,gr.theday ->排序(成本=8.33..8.34行=1宽度=17) 排序键:gr.action\u name,gr.theday ->嵌套循环(成本=1.40..8.33行=1宽度=17) ->嵌套循环(成本=1.31..7.78行=1宽度=40) 加入筛选器:(es.client\u id=cli.id) ->在客户端cli上使用客户端\客户端\名称\键进行索引扫描(成本=0.09..2.30行=1宽度=4) 索引条件:(客户端名称=‘客户端名称’::文本) ->事件统计上的位图堆扫描(成本=1.22..5.45行=5宽度=44) 重新检查条件:((日期更新>=(‘现在’::cstring)::日期-‘7天’::间隔))和(日期更新位图索引扫描在iii上(成本=0.00..1.22行=5宽度=0) 索引条件:((date\u update>=('now'::cstring::date-'7天]::interval))和(date\u update Index仅使用网格gr上的网格\u theday\u action\u name\u idx扫描(成本=0.09..0.54行=1宽度=17) 索引条件:((日期=(日期('day'::文本,es.date\u update))::日期)和(操作名称=es.action\u名称)) (15排)
1000是生成_系列()的默认估计值。在这种情况下,它太大了……可能将日历文件具体化(+analyze)会提示优化者。请回答您的问题,并添加使用函数扫描生成_系列t(成本=0.02..10.02行=1000宽度=8)
生成的执行计划(注意explain(analyze,verbose)
选项!)我测试了您的解决方案并检查了查询计划。感谢您的努力,但不幸的是,您提出的查询执行速度比我的查询慢了一点(+1-2秒),因为它执行更多的检查和联接。我无法从“action\u name”上的筛选器中受益,因为我需要聚合“event\u statistics”中可能包含的所有查询表中,我手动将它们列为值,因为从表中获取不同的“action\u name”值非常慢。请尝试将action\u name替换为一个数值analyze
,并将action\u names{id,action\u name}作为维度表。然后添加一些可用的(复合)索引。仍然很慢。当有很多时间戳和操作名称的组合时,索引的临时表会有所帮助,但在当前条件下不会有什么不同。否则我会这样做。我之前也尝试了(操作名称、日期更新)上的索引,但没有帮助,尽管我应用了“真空分析”。优化器几乎总是选择seq.scan而不是该索引。您似乎在该事实表中存在数据建模问题:太多的关键元素(低基数)和太多(几乎)候选键。此外:骨骼上有很多肉(文本、json等),但无论如何都可能会烤熟。我还有一个候选(基本上是改革时间戳->日期部分操作),这在这里创建了良好的计划…BRB。。。action\u id integer not null外键,引用action\u name id)
CREATE TABLE public.clients ( id int4 NOT NULL DEFAULT nextval('clients_id_seq'::regclass), client_name text NULL, client_hash text NULL, CONSTRAINT clients_pkey PRIMARY KEY (id) ) WITH ( OIDS=FALSE ) ; CREATE INDEX clients_client_name_idx ON public.clients (client_name text_ops) ;
SELECT t.date, A.actionName, count(E.id) FROM generate_series(current_date - interval '1 week',now(),interval '1 day') as t(date) cross join (values ('page_open'), ('product_add'), ('product_buy'), ('product_event'), ('product_favourite'), ('product_open'), ('product_share'), ('session_start')) as A(actionName) left join (select action_name,date_trunc('day',e.date_update) as dateTime, e.id from event_statistics as e where e.client_id = (select id from clients as c where c.client_name = 'client name') and (date_update between (current_date - interval '1 week') and now())) E on t.date = E.dateTime and A.actionName = E.action_name group by A.actionName,t.date order by A.actionName,t.date;
GroupAggregate (cost=171937.16..188106.84 rows=1600 width=44) Group Key: "*VALUES*".column1, t.date InitPlan 1 (returns $0) -> Seq Scan on clients c (cost=0.00..1.07 rows=1 width=4) Filter: (client_name = 'client name'::text) -> Merge Left Join (cost=171936.08..183784.31 rows=574060 width=44) Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date =(date_trunc('day'::text, e.date_update)))) -> Sort (cost=628.77..648.77 rows=8000 width=40) Sort Key: "*VALUES*".column1, t.date -> Nested Loop (cost=0.02..110.14 rows=8000 width=40) -> Function Scan on generate_series t (cost=0.02..10.02 rows=1000 width=8) -> Materialize (cost=0.00..0.14 rows=8 width=32) -> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=32) -> Materialize (cost=171307.32..171881.38 rows=114812 width=24) -> Sort (cost=171307.32..171594.35 rows=114812 width=24) Sort Key: e.action_name, (date_trunc('day'::text, e.date_update)) -> Index Scan using regdate on event_statistics e (cost=0.57..159302.49 rows=114812 width=24) Index Cond: ((date_update > (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= now())) Filter: (client_id = $0)
GroupAggregate (cost=860934.44..969228.46 rows=1600 width=44) (actual time=52388.678..54671.187 rows=64 loops=1) Output: t.date, "*VALUES*".column1, count(e.id) Group Key: "*VALUES*".column1, t.date InitPlan 1 (returns $0) -> Seq Scan on public.clients c (cost=0.00..1.07 rows=1 width=4) (actual time=0.058..0.059 rows=1 loops=1) Output: c.id Filter: (c.client_name = 'client name'::text) Rows Removed by Filter: 5 -> Merge Left Join (cost=860933.36..940229.77 rows=3864215 width=44) (actual time=52388.649..54388.698 rows=799737 loops=1) Output: t.date, "*VALUES*".column1, e.id Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date = (date_trunc('day'::text, e.date_update)))) -> Sort (cost=628.77..648.77 rows=8000 width=40) (actual time=0.190..0.244 rows=64 loops=1) Output: t.date, "*VALUES*".column1 Sort Key: "*VALUES*".column1, t.date Sort Method: quicksort Memory: 30kB -> Nested Loop (cost=0.02..110.14 rows=8000 width=40) (actual time=0.059..0.080 rows=64 loops=1) Output: t.date, "*VALUES*".column1 -> Function Scan on pg_catalog.generate_series t (cost=0.02..10.02 rows=1000 width=8) (actual time=0.043..0.043 rows=8 loops=1) Output: t.date Function Call: generate_series(((('now'::cstring)::date - '7 days'::interval))::timestamp with time zone, now(), '1 day'::interval) -> Materialize (cost=0.00..0.14 rows=8 width=32) (actual time=0.002..0.003 rows=8 loops=8) Output: "*VALUES*".column1 -> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=32) (actual time=0.004..0.005 rows=8 loops=1) Output: "*VALUES*".column1 -> Materialize (cost=860304.60..864168.81 rows=772843 width=24) (actual time=52388.441..54053.748 rows=799720 loops=1) Output: e.id, e.date_update, e.action_name, (date_trunc('day'::text, e.date_update)) -> Sort (cost=860304.60..862236.70 rows=772843 width=24) (actual time=52388.432..53703.531 rows=799720 loops=1) Output: e.id, e.date_update, e.action_name, (date_trunc('day'::text, e.date_update)) Sort Key: e.action_name, (date_trunc('day'::text, e.date_update)) Sort Method: external merge Disk: 39080kB -> Index Scan using regdate on public.event_statistics e (cost=0.57..753018.26 rows=772843 width=24) (actual time=31.423..44284.363 rows=799720 loops=1) Output: e.id, e.date_update, e.action_name, date_trunc('day'::text, e.date_update) Index Cond: ((e.date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (e.date_update <= now())) Filter: (e.client_id = $0) Rows Removed by Filter: 2983424 Planning time: 7.278 ms Execution time: 54708.041 ms
EXPLAIN SELECT cal.theday, act.action_name, SUM(sub.the_count) FROM generate_series(current_date - interval '1 week', now(), interval '1 day') as cal(theday) -- calendar pseudo-table CROSS JOIN (VALUES ('page_open') , ('product_add') , ('product_buy') , ('product_event') , ('product_favourite') , ('product_open') , ('product_share') , ('session_start') ) AS act(action_name) LEFT JOIN ( SELECT es.action_name, date_trunc('day',es.date_update) as theday , COUNT(DISTINCT es.id ) AS the_count FROM event_statistics as es WHERE es.client_id = (SELECT c.id FROM clients AS c WHERE c.client_name = 'client name') AND (es.date_update BETWEEN (current_date - interval '1 week') AND now()) GROUP BY 1,2 ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name GROUP BY act.action_name,cal.theday ORDER BY act.action_name,cal.theday ;
EXPLAIN WITH act(action_name) AS (VALUES ('page_open') , ('product_add') , ('product_buy') , ('product_event') , ('product_favourite') , ('product_open') , ('product_share') , ('session_start') ) SELECT cal.theday, act.action_name, SUM(sub.the_count) FROM generate_series(current_date - interval '1 week', now(), interval '1day') AS cal(theday) CROSS JOIN act LEFT JOIN ( SELECT es.action_name, date_trunc('day',es.date_update) AS theday , COUNT(DISTINCT es.id ) AS the_count FROM event_statistics AS es WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now() AND EXISTS (SELECT * FROM clients cli WHERE cli.id= es.client_id AND cli.client_name = 'client name') AND EXISTS (SELECT * FROM act WHERE act.action_name = es.action_name) GROUP BY 1,2 ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name GROUP BY act.action_name,cal.theday ORDER BY act.action_name,cal.theday ;
-- Final attempt: materialize the carthesian product (timeseries*action_name) -- into a temp table CREATE TEMP TABLE grid AS (SELECT act.action_name, cal.theday FROM generate_series(current_date - interval '1 week', now(), interval '1 day') AS cal(theday) CROSS JOIN (VALUES ('page_open') , ('product_add') , ('product_buy') , ('product_event') , ('product_favourite') , ('product_open') , ('product_share') , ('session_start') ) act(action_name) ); CREATE UNIQUE INDEX ON grid(action_name, theday); -- Index will force statistics to be collected -- ,and will generate better estimates for the numbers of rows CREATE INDEX iii ON event_statistics (action_name, date_update ) ; VACUUM ANALYZE grid; VACUUM ANALYZE event_statistics; EXPLAIN SELECT grid.action_name, grid.theday, SUM(sub.the_count) AS the_count FROM grid LEFT JOIN ( SELECT es.action_name, date_trunc('day',es.date_update) AS theday , COUNT(*) AS the_count FROM event_statistics AS es WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now() AND EXISTS (SELECT * FROM clients cli WHERE cli.id= es.client_id AND cli.client_name = 'client name') -- AND EXISTS (SELECT * FROM grid WHERE grid.action_name = es.action_name) GROUP BY 1,2 ORDER BY 1,2 --nonsense! ) sub ON grid.theday = sub.theday AND grid.action_name = sub.action_name GROUP BY grid.action_name,grid.theday ORDER BY grid.action_name,grid.theday ;
-- attempt#4: -- - materialize the carthesian product (timeseries*action_name) -- - sanitize date interval -logic CREATE TEMP TABLE grid AS (SELECT act.action_name, cal.theday::date FROM generate_series(current_date - interval '1 week', now(), interval '1 day') AS cal(theday) CROSS JOIN (VALUES ('page_open') , ('product_add') , ('product_buy') , ('product_event') , ('product_favourite') , ('product_open') , ('product_share') , ('session_start') ) act(action_name) ); -- Index will force statistics to be collected -- ,and will generate better estimates for the numbers of rows -- CREATE UNIQUE INDEX ON grid(action_name, theday); -- CREATE INDEX iii ON event_statistics (action_name, date_update ) ; CREATE UNIQUE INDEX ON grid(theday, action_name); CREATE INDEX iii ON event_statistics (date_update, action_name) ; VACUUM ANALYZE grid; VACUUM ANALYZE event_statistics; EXPLAIN SELECT gr.action_name, gr.theday , COUNT(*) AS the_count FROM grid gr LEFT JOIN event_statistics AS es ON es.action_name = gr.action_name AND date_trunc('day',es.date_update)::date = gr.theday AND es.date_update BETWEEN (current_date - interval '1 week') AND current_date JOIN clients cli ON cli.id= es.client_id AND cli.client_name = 'client name' GROUP BY gr.action_name,gr.theday ORDER BY 1,2 ;
QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=8.33..8.35 rows=1 width=17) Group Key: gr.action_name, gr.theday -> Sort (cost=8.33..8.34 rows=1 width=17) Sort Key: gr.action_name, gr.theday -> Nested Loop (cost=1.40..8.33 rows=1 width=17) -> Nested Loop (cost=1.31..7.78 rows=1 width=40) Join Filter: (es.client_id = cli.id) -> Index Scan using clients_client_name_key on clients cli (cost=0.09..2.30 rows=1 width=4) Index Cond: (client_name = 'client name'::text) -> Bitmap Heap Scan on event_statistics es (cost=1.22..5.45 rows=5 width=44) Recheck Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date)) -> Bitmap Index Scan on iii (cost=0.00..1.22 rows=5 width=0) Index Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date)) -> Index Only Scan using grid_theday_action_name_idx on grid gr (cost=0.09..0.54 rows=1 width=17) Index Cond: ((theday = (date_trunc('day'::text, es.date_update))::date) AND (action_name = es.action_name)) (15 rows)