Sql 优化大型子表的日期查询：GiST还是GIN？_Sql_Postgresql_Date_Query Optimization_Full Table Scan

Sql 优化大型子表的日期查询：GiST还是GIN？

sql postgresql date

Sql 优化大型子表的日期查询：GiST还是GIN？,sql,postgresql,date,query-optimization,full-table-scan,Sql,Postgresql,Date,Query Optimization,Full Table Scan,问题 72个子表，每个表都有一个年份索引和一个站点索引，定义如下： CREATE TABLE climate.measurement_12_013 ( -- Inherited from table climate.measurement_12_013: id bigint NOT NULL DEFAULT nextval('climate.measurement_id_seq'::regclass), -- Inherited from table climate.measurement_1

问题

72个子表，每个表都有一个年份索引和一个站点索引，定义如下：

CREATE TABLE climate.measurement_12_013
(
-- Inherited from table climate.measurement_12_013:  id bigint NOT NULL DEFAULT nextval('climate.measurement_id_seq'::regclass),
-- Inherited from table climate.measurement_12_013:  station_id integer NOT NULL,
-- Inherited from table climate.measurement_12_013:  taken date NOT NULL,
-- Inherited from table climate.measurement_12_013:  amount numeric(8,2) NOT NULL,
-- Inherited from table climate.measurement_12_013:  category_id smallint NOT NULL,
-- Inherited from table climate.measurement_12_013:  flag character varying(1) NOT NULL DEFAULT ' '::character varying,
  CONSTRAINT measurement_12_013_category_id_check CHECK (category_id = 7),
  CONSTRAINT measurement_12_013_taken_check CHECK (date_part('month'::text, taken)::integer = 12)
)
INHERITS (climate.measurement)

CREATE INDEX measurement_12_013_s_idx
  ON climate.measurement_12_013
  USING btree
  (station_id);
CREATE INDEX measurement_12_013_y_idx
  ON climate.measurement_12_013
  USING btree
  (date_part('year'::text, taken));

（稍后将添加外键约束。）

由于进行了完整的表扫描，以下查询的运行速度非常慢：

SELECT
  count(1) AS measurements,
  avg(m.amount) AS amount
FROM
  climate.measurement m
WHERE
  m.station_id IN (
    SELECT
      s.id
    FROM
      climate.station s,
      climate.city c
    WHERE
        /* For one city... */
        c.id = 5182 AND

        /* Where stations are within an elevation range... */
        s.elevation BETWEEN 0 AND 3000 AND

        /* and within a specific radius... */
        6371.009 * SQRT( 
          POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
            (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
              POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
        ) <= 50
    ) AND

  /* Data before 1900 is shaky; insufficient after 2009. */
  extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

  /* Whittled down by category... */
  m.category_id = 1 AND

  /* Between the selected days and years... */
  m.taken BETWEEN
   /* Start date. */
   (extract( YEAR FROM m.taken )||'-01-01')::date AND
    /* End date. Calculated by checking to see if the end date wraps
       into the next year. If it does, then add 1 to the current year.
    */
    (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
      sign(
        (extract( YEAR FROM m.taken )||'-12-31')::date -
        (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
    ) AS text)||'-12-31')::date
GROUP BY
  extract( YEAR FROM m.taken )

查询的这一部分与选择的日期相匹配。例如，如果用户希望查看有数据的所有年份中6月1日至7月1日之间的数据，则上述子句仅与这些日期匹配。如果用户希望查看12月22日至3月22日之间的数据，同样对于有数据的所有年份，上述条款计算3月22日在下一年12月22日，因此与日期相应匹配：

目前，日期固定为1月1日至12月31日，但将参数化，如上所示

该计划的总成本为10006220141.11，我怀疑这是天文数字上的巨大损失

正在执行的度量表（本身既没有数据也没有索引）上的完整表扫描。该表从其子表聚合了2.73亿行

问题

为避免全表扫描，索引日期的正确方法是什么

我考虑过的备选方案：

杜松子酒
要点
重写WHERE子句
表中有单独的年份、月份和日期列

你的想法是什么

谢谢大家!

试试这样的方法：

create temporary table test (d date);

insert into test select '1970-01-01'::date+generate_series(1,50*365);

analyze test

create function month_day(d date) returns int as $$
  select extract(month from $1)::int*100+extract(day from $1)::int $$
language sql immutable strict;

create index test_d_month_day_idx on test (month_day(d));

explain analyze select * from test
  where month_day(d)>=month_day('2000-04-01')
  and month_day(d)<=month_day('2000-04-05');

创建临时表测试（d日期）；
插入测试选择“1970-01-01”：：日期+生成_系列（1,50*365）；
分析测试
创建函数月日（d日期）返回整数作为$$
选择提取（从$1开始的月份）：：int*100+提取（从$1开始的日期）：：int$$
语言sql不可变严格；
在测试（月日（d））上创建索引测试（月日）；
解释分析从测试中选择*
式中，月日（d）>=月日（'2000-04-01'）
而month_day（d）您的问题是，根据日期的计算，您有一个where子句。如果数据库需要获取每一行并在知道日期是否匹配之前对其进行计算，则无法使用索引
除非您将其改写为数据库有一个固定的检查范围的形式，而该范围不依赖于要检索的数据，否则您将始终必须扫描表。
我认为，要在这些分区上高效运行此功能，我希望您的应用程序在日期范围方面更加智能。让它为每个分区生成一个要检查的实际日期列表，然后让它生成一个带有分区之间的联合的查询。听起来您的数据集是相当静态的，因此在日期索引上添加一个集群也可以极大地提高性能。
这一缓慢的部分有什么作用？“我不明白。@托梅茨基：我用一张显示查询参数的图片更新了这个问题。你说你的表是按年份和车站划分的，但你的约束不匹配——计划者可能没有适当地修剪。更不用说，如果您通过完全开放地运行它来跨越所有分区，那么您的成本将大大增加。减少或更改分区可能会有所帮助。@rfusca：分区（我希望）不是问题所在；问题是正在执行完整表扫描，因为planner无法将计算日期（从字符串）与实际日期进行比较。因此，它调用一个完整的表扫描。我的一个想法是把两组日期分为两组。对于12月22日至3月22日
，查询将查看当年的12月22日至12月31日
，以及次年的1月1日至3月22日。我想说的是，是的，这是真的，但你也可能在做72表扫描，不是吗？@Cobusive:我当时的印象是，GIN和GiST可以避免这种情况。此外，我正在寻找最好的解决方案。全表扫描不是一个选项；有近3亿行！：-）MySQL可以在5秒内执行查询；PostgreSQL尚未完成。我已经考虑过这一点；我可以让应用程序生成WHERE子句的一部分，但是我将紧密耦合两个完全不同的系统：网站和报表引擎。您可以将WHERE子句的动态生成移动到pl/pgsql函数中，然后您可以让它构建查询，只需“从我的程序（参数）中选择*”-和以前没什么不同。这也是一个有趣的提议。一旦基本查询开始工作（并且速度很快），我将进一步研究这个问题。非常感谢。因为桌子没有订好，所以很慢。我通过（1）按主键顺序重新插入数据来修复此问题；（2） 添加了一个聚类索引。
create temporary table test (d date);

insert into test select '1970-01-01'::date+generate_series(1,50*365);

analyze test

create function month_day(d date) returns int as $$
  select extract(month from $1)::int*100+extract(day from $1)::int $$
language sql immutable strict;

create index test_d_month_day_idx on test (month_day(d));

explain analyze select * from test
  where month_day(d)>=month_day('2000-04-01')
  and month_day(d)<=month_day('2000-04-05');