Performance 阵列中大数据的postgresql性能

Performance 阵列中大数据的postgresql性能,performance,postgresql,Performance,Postgresql,postgresql server 9.1 横幅40K行和事件140M行-包含客户端数据的表。client_id行是带有索引的客户端的整数id 第一个问题: SELECT DISTINCT client_id FROM events WHERE type = 'banner_show' AND client_id IN (select distinct client_id from banners) EXPLAIN ANALYZE SELECT DISTINCT client_id FRO

postgresql server 9.1

横幅40K行和事件140M行-包含客户端数据的表。client_id行是带有索引的客户端的整数id

第一个问题:

SELECT DISTINCT client_id
FROM events 
WHERE type = 'banner_show' AND client_id IN (select distinct client_id from banners)
EXPLAIN ANALYZE
SELECT DISTINCT client_id 
FROM events 
WHERE type = 'banner_show' AND client_id IN (select distinct client_id from banners)

"HashAggregate  (cost=4481767.32..4481767.74 rows=42 width=4) (actual        time=24726.275..24727.259 rows=8572 loops=1)"
"  ->  Hash Join  (cost=1954.16..4481542.58 rows=89895 width=4) (actual time=16052.849..24698.907 rows=68770 loops=1)"
"        Hash Cond: (events.client_id = banners.client_id)"
"        ->  Seq Scan on events  (cost=0.00..4476744.47 rows=179790 width=4) (actual time=16037.562..24634.461 rows=69272 loops=1)"
"              Filter: ((type)::text = 'banner_show'::text)"
"        ->  Hash  (cost=1767.58..1767.58 rows=14926 width=4) (actual time=15.258..15.258 rows=13923 loops=1)"
"              Buckets: 2048  Batches: 1  Memory Usage: 490kB"
"              ->  HashAggregate  (cost=1469.06..1618.32 rows=14926 width=4) (actual time=12.421..13.805 rows=13923 loops=1)"
"                    ->  Seq Scan on banners  (cost=0.00..1369.45 rows=39845 width=4) (actual time=0.005..6.883 rows=38184 loops=1)"
"Total runtime: 24727.909 ms"
工作约23秒

第二个问题:

SELECT DISTINCT client_id
FROM events 
WHERE type = 'banner_show' AND client_id IN (1, 2, 3, 4...)
"HashAggregate  (cost=842924414.03..842924414.17 rows=14 width=4) (actual time=1521873.754..1521874.796 rows=8574 loops=1)    "
"  ->  Bitmap Heap Scan on events  (cost=534167.70..842924261.77 rows=60905 width=4) (actual time=260305.233..1521811.644 rows=68782 loops=1)    "
"        Recheck Cond: (client_id = ANY ('{153566,171259,151232,155132,160170,162720,152159,166302,175899,158611,}'::integer[]))    "
"        Filter: ((type)::text = 'banner_show'::text)    "
"        ->  Bitmap Index Scan on ix_events_client_id  (cost=0.00..534152.47 rows=48209684 width=0) (actual time=4916.828..4916.828 rows=5345417 loops=1)    "
"              Index Cond: (client_id = ANY ('{153566,171259,151232,155132,......}'::integer[]))    "
"Total runtime: 1521875.137 ms    "
其中1,2,3,4…-查询结果从横幅中选择不同的客户端id。 第二个查询工作了大约10分钟,直到我停止它。 为什么使用相同数据的查询在性能上存在如此显著的差异

解释第一个查询:

SELECT DISTINCT client_id
FROM events 
WHERE type = 'banner_show' AND client_id IN (select distinct client_id from banners)
EXPLAIN ANALYZE
SELECT DISTINCT client_id 
FROM events 
WHERE type = 'banner_show' AND client_id IN (select distinct client_id from banners)

"HashAggregate  (cost=4481767.32..4481767.74 rows=42 width=4) (actual        time=24726.275..24727.259 rows=8572 loops=1)"
"  ->  Hash Join  (cost=1954.16..4481542.58 rows=89895 width=4) (actual time=16052.849..24698.907 rows=68770 loops=1)"
"        Hash Cond: (events.client_id = banners.client_id)"
"        ->  Seq Scan on events  (cost=0.00..4476744.47 rows=179790 width=4) (actual time=16037.562..24634.461 rows=69272 loops=1)"
"              Filter: ((type)::text = 'banner_show'::text)"
"        ->  Hash  (cost=1767.58..1767.58 rows=14926 width=4) (actual time=15.258..15.258 rows=13923 loops=1)"
"              Buckets: 2048  Batches: 1  Memory Usage: 490kB"
"              ->  HashAggregate  (cost=1469.06..1618.32 rows=14926 width=4) (actual time=12.421..13.805 rows=13923 loops=1)"
"                    ->  Seq Scan on banners  (cost=0.00..1369.45 rows=39845 width=4) (actual time=0.005..6.883 rows=38184 loops=1)"
"Total runtime: 24727.909 ms"
解释第二个查询:

SELECT DISTINCT client_id
FROM events 
WHERE type = 'banner_show' AND client_id IN (1, 2, 3, 4...)
"HashAggregate  (cost=842924414.03..842924414.17 rows=14 width=4) (actual time=1521873.754..1521874.796 rows=8574 loops=1)    "
"  ->  Bitmap Heap Scan on events  (cost=534167.70..842924261.77 rows=60905 width=4) (actual time=260305.233..1521811.644 rows=68782 loops=1)    "
"        Recheck Cond: (client_id = ANY ('{153566,171259,151232,155132,160170,162720,152159,166302,175899,158611,}'::integer[]))    "
"        Filter: ((type)::text = 'banner_show'::text)    "
"        ->  Bitmap Index Scan on ix_events_client_id  (cost=0.00..534152.47 rows=48209684 width=0) (actual time=4916.828..4916.828 rows=5345417 loops=1)    "
"              Index Cond: (client_id = ANY ('{153566,171259,151232,155132,......}'::integer[]))    "
"Total runtime: 1521875.137 ms    "
表2:

CREATE TABLE banners
(
  id serial NOT NULL,
  type_id integer,
  form_id integer,
  banner character varying,
  client_id integer,
  created timestamp without time zone,
  deleted timestamp without time zone,
  CONSTRAINT banners_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE banners
  OWNER TO postgres;

CREATE INDEX ix_banners_client_id
  ON banners
  USING btree
  (client_id);


CREATE TABLE events
(
  id serial NOT NULL,
  time_created timestamp without time zone,
  type character varying,
  date timestamp without time zone,
  param character varying,
  client_id integer,
  hash_id character varying,
  CONSTRAINT events_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE events
  OWNER TO postgres;

CREATE INDEX ix_events_client_id
  ON events
  USING btree
  (client_id);

CREATE INDEX ix_events_hash_id
  ON events
  USING btree
  (hash_id COLLATE pg_catalog."default");

当筛选条件有两列时,必须创建一个索引来覆盖这两列,请参见

CREATE INDEX event_client_show_idx 
ON events
USING btree (client_id, type);
第一个选择+解释

EXPLAIN
SELECT DISTINCT client_id
FROM events 
WHERE client_id IN (1, 2, 3, 4) AND type = 'banner_show';
返回如下内容:

Unique  (cost=0.15..8.65 rows=1 width=4)
  ->  Index Only Scan using event_client_show_idx on events  (cost=0.15..8.65 rows=1 width=4)
        Index Cond: ((client_id = ANY ('{1,2,3,4}'::integer[])) AND (type = 'banner_show'::text))

在Markus Winand博客的

Offtopic上阅读有关索引的更多信息:加入或存在可能是最快的解决方案,在中,当您有许多值时会出现问题。Ontopic:请添加EXPLAIN Analysis的结果,以查看查询计划中的差异EXPLAIN Analysis和添加的表shemas。表中没有pk。99.8%的查询时间用于顺序扫描。尤其是事件进展缓慢得令人痛苦。两个表中是否都有客户机id的索引?两个表中都有客户机id的树索引。