Postgresql 使Postgres查询更快。更多索引？_Postgresql_Postgis

Postgresql 使Postgres查询更快。更多索引？

postgresql

Postgresql 使Postgres查询更快。更多索引？,postgresql,postgis,Postgresql,Postgis,我正在运行Geodjango/postgres9.1/PostGIS，我正在尝试让下面的查询（以及其他类似的查询）运行得更快 [为简洁起见，删除了查询] SELECT "crowdbreaks_incomingkeyword"."keyword_id" , COUNT("crowdbreaks_incomingkeyword"."keyword_id") AS "cnt" FROM "crowdbreaks_incomingkeyword" INNER JOIN "crow

我正在运行Geodjango/postgres9.1/PostGIS，我正在尝试让下面的查询（以及其他类似的查询）运行得更快

[为简洁起见，删除了查询]

SELECT "crowdbreaks_incomingkeyword"."keyword_id"
       , COUNT("crowdbreaks_incomingkeyword"."keyword_id") AS "cnt" 
  FROM "crowdbreaks_incomingkeyword"
 INNER JOIN "crowdbreaks_tweet"
       ON ("crowdbreaks_incomingkeyword"."tweet_id"
          = "crowdbreaks_tweet"."tweet_id")
  LEFT OUTER JOIN "crowdbreaks_place"
    ON ("crowdbreaks_tweet"."place_id"
       = "crowdbreaks_place"."place_id") 
 WHERE (("crowdbreaks_tweet"."coordinates"
        @ ST_GeomFromEWKB(E'\\001 ... \\000\\000\\000\\0008@'::bytea)
       OR ST_Overlaps("crowdbreaks_place"."bounding_box"
                     , ST_GeomFromEWKB(E'\\001...00\\000\\0008@'::bytea)
       )) 
   AND "crowdbreaks_tweet"."created_at" > E'2012-04-17 15:46:12.109893'
   AND "crowdbreaks_tweet"."created_at" < E'2012-04-18 15:46:12.109899' ) 
 GROUP BY "crowdbreaks_incomingkeyword"."keyword_id"
         , "crowdbreaks_incomingkeyword"."keyword_id"
    ;

下面是查询的解释分析：

 HashAggregate  (cost=184022.03..184023.18 rows=115 width=4) (actual time=6381.707..6381.769 rows=62 loops=1)
   ->  Hash Join  (cost=103857.48..183600.24 rows=84357 width=4) (actual time=1745.449..6377.505 rows=3453 loops=1)
         Hash Cond: (crowdbreaks_incomingkeyword.tweet_id = crowdbreaks_tweet.tweet_id)
         ->  Seq Scan on crowdbreaks_incomingkeyword  (cost=0.00..36873.97 rows=2252597 width=12) (actual time=0.008..2136.839 rows=2252597 loops=1)
         ->  Hash  (cost=102535.68..102535.68 rows=80544 width=8) (actual time=1744.815..1744.815 rows=3091 loops=1)
               Buckets: 4096  Batches: 4  Memory Usage: 32kB
               ->  Hash Left Join  (cost=16574.93..102535.68 rows=80544 width=8) (actual time=112.551..1740.651 rows=3091 loops=1)
                     Hash Cond: ((crowdbreaks_tweet.place_id)::text = (crowdbreaks_place.place_id)::text)
                     Filter: ((crowdbreaks_tweet.coordinates @ '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) OR ((crowdbreaks_place.bounding_box && '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) AND _st_overlaps(crowdbreaks_place.bounding_box, '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry)))
                     ->  Bitmap Heap Scan on crowdbreaks_tweet  (cost=15874.18..67060.28 rows=747873 width=125) (actual time=96.012..940.462 rows=736784 loops=1)
                           Recheck Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                           ->  Bitmap Index Scan on crowdbreaks_tweet_crreated_at  (cost=0.00..15687.22 rows=747873 width=0) (actual time=94.259..94.259 rows=736784 loops=1)
                                 Index Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                     ->  Hash  (cost=217.11..217.11 rows=6611 width=469) (actual time=15.926..15.926 rows=6611 loops=1)
                           Buckets: 1024  Batches: 4  Memory Usage: 259kB
                           ->  Seq Scan on crowdbreaks_place  (cost=0.00..217.11 rows=6611 width=469) (actual time=0.005..6.908 rows=6611 loops=1)
 Total runtime: 6381.903 ms
(17 rows)

HashAggregate（成本=184022.03..184023.18行=115宽度=4）（实际时间=6381.707..6381.769行=62圈=1）
->散列联接（成本=103857.48..183600.24行=84357宽度=4）（实际时间=1745.449..6377.505行=3453循环=1）
Hash Cond：（crowdbreaks\u incomingkeyword.tweet\u id=crowdbreaks\u tweet.tweet\u id）
->序列扫描crowdbreaks_incomingkeyword（成本=0.00..36873.97行=2252597宽度=12）（实际时间=0.008..2136.839行=2252597循环=1）
->散列（成本=102535.68..102535.68行=80544宽度=8）（实际时间=1744.815..1744.815行=3091循环=1）
存储桶：4096批：4内存使用率：32kB
->散列左连接（成本=16574.93..102535.68行=80544宽度=8）（实际时间=112.551..1740.651行=3091循环=1）
散列条件：（（crowdbreaks\u tweet.place\u id）：：text=（crowdbreaks\u place.place\u id）：：text）
过滤器：（（crowdbreaks_tweet.coordinates@'0103000020E61000000050000000AE47E17A141E5FC00000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC000000003840'：几何图形）或(（crowdbreaks\u place.bounding\u box和“0103000020E61000000000000050000000AE47E17A141E5FC00000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC000000003840”：几何图形）和重叠（crowdbreaks_place.bounding_box，“0103000020E61000000000000050000AE47E17A141E5FC00000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000003840AE47E17A141E5FC000000003840'：几何图形）
->crowdbreaks_tweet上的位图堆扫描（成本=15874.18..67060.28行=747873宽度=125）（实际时间=96.012..940.462行=736784循环=1）
重新检查条件：（（创建于>'2012-04-17 15:46:12.109893+00'：带时区的时间戳）和（创建于<'2012-04-18 15:46:12.109899+00'：带时区的时间戳））
->crowdbreaks上的位图索引扫描\u tweet\u crreated\u at（成本=0.00..15687.22行=747873宽度=0）（实际时间=94.259..94.259行=736784循环=1）
索引条件：（（创建于>'2012-04-17 15:46:12.109893+00'：带时区的时间戳）和（创建于<'2012-04-18 15:46:12.109899+00'：带时区的时间戳））
->散列（成本=217.11..217.11行=6611宽度=469）（实际时间=15.926..15.926行=6611循环=1）
存储桶：1024批：4内存使用率：259kB
->在crowdbreaks_place（成本=0.00..217.11行=6611宽度=469）上的顺序扫描（实际时间=0.005..6.908行=6611圈=1）
总运行时间：6381.903毫秒
（17排）

对于查询来说，这是一个非常糟糕的运行时。理想情况下，我希望在一两秒钟内返回结果

我已经将Postgres上的共享缓冲区增加到2GB（我有8GB的RAM）但除此之外，我不太清楚该怎么办。我有什么选择？我应该少做连接吗？我可以在那里添加其他索引吗？对crowdbreaks\u incomingkeyword的顺序扫描对我来说没有意义。它是一个外键到其他表的表，因此上面有索引。

从你的评论来看，我会尝试两个欣：

为相关列引发（并运行
```
ANALYZE
```
）

数据分布可能不均匀。较大的样本可能为查询计划员提供更准确的估计

在
```
postgresql.conf
```
中使用。与索引扫描相比，您的顺序扫描可能需要更昂贵才能给出良好的估计

尝试降低

cpu\u index\u tuple\u cost的成本

并将

effective\u cache\u size

设置为专用数据库服务器总RAM的四分之三。

如果我在运行解释分析之前将enable seqscan设置为off；，查询时间将减少到1.8秒。但是，我读到的所有内容都表明我不应该执行解释分析你可以这样做。我将你的解释输出粘贴在这里：以获得更好的视图。顺便说一句：在GROUP BY子句中复制术语

crowdbreaks\u incomingkeyword.“.keyword\u id

”，会有什么影响？（可能是优化者忘记删除这种冗余，并将组的特异性计算为1/平方（组数）我是否为我用于筛选的列、我选择的列或两者都提供统计信息？@Khandelwal:您筛选和加入的列与计划相关。只是选择并不相关。@Khandelwal:如果数据库的大部分相关部分都适合RAM，您应该将

随机页面成本设置得更低-仅略高于seq\p年龄成本
。我有1
用于seq\u page\u成本
和1.1
用于random\u page\u成本
用于大部分缓存的db集群。在RAM中，随机访问基本上和顺序访问一样快。只有在光盘上，它要慢得多。我经常发现的另一个有用设置是将cpu元组成本
提高到大约在0.03到0.05之间。这对于降低cpu\u index\u tuple\u成本可能有些多余，因为成本因素是相对的。顺便说一句，提高有效缓存大小的原因是它告诉规划者当在单个查询中重复访问时，有多少索引可能留在缓存中；higher值往往会使索引访问看起来更便宜。通常，我会将有效的_缓存_大小设置为尽可能高（可用内存-所有进程使用的内存。在您的情况下，大约为8-2=6GB），我倾向于将共享_缓冲区保持在较低的水平。我听说共享_缓冲区对om VM来说很痛苦（可能是因为
 HashAggregate  (cost=184022.03..184023.18 rows=115 width=4) (actual time=6381.707..6381.769 rows=62 loops=1)
   ->  Hash Join  (cost=103857.48..183600.24 rows=84357 width=4) (actual time=1745.449..6377.505 rows=3453 loops=1)
         Hash Cond: (crowdbreaks_incomingkeyword.tweet_id = crowdbreaks_tweet.tweet_id)
         ->  Seq Scan on crowdbreaks_incomingkeyword  (cost=0.00..36873.97 rows=2252597 width=12) (actual time=0.008..2136.839 rows=2252597 loops=1)
         ->  Hash  (cost=102535.68..102535.68 rows=80544 width=8) (actual time=1744.815..1744.815 rows=3091 loops=1)
               Buckets: 4096  Batches: 4  Memory Usage: 32kB
               ->  Hash Left Join  (cost=16574.93..102535.68 rows=80544 width=8) (actual time=112.551..1740.651 rows=3091 loops=1)
                     Hash Cond: ((crowdbreaks_tweet.place_id)::text = (crowdbreaks_place.place_id)::text)
                     Filter: ((crowdbreaks_tweet.coordinates @ '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) OR ((crowdbreaks_place.bounding_box && '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) AND _st_overlaps(crowdbreaks_place.bounding_box, '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry)))
                     ->  Bitmap Heap Scan on crowdbreaks_tweet  (cost=15874.18..67060.28 rows=747873 width=125) (actual time=96.012..940.462 rows=736784 loops=1)
                           Recheck Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                           ->  Bitmap Index Scan on crowdbreaks_tweet_crreated_at  (cost=0.00..15687.22 rows=747873 width=0) (actual time=94.259..94.259 rows=736784 loops=1)
                                 Index Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                     ->  Hash  (cost=217.11..217.11 rows=6611 width=469) (actual time=15.926..15.926 rows=6611 loops=1)
                           Buckets: 1024  Batches: 4  Memory Usage: 259kB
                           ->  Seq Scan on crowdbreaks_place  (cost=0.00..217.11 rows=6611 width=469) (actual time=0.005..6.908 rows=6611 loops=1)
 Total runtime: 6381.903 ms
(17 rows)

ALTER TABLE tbl ALTER COLUMN column SET STATISTICS 1000;