PostgreSQL:数据子集中的非空替换

PostgreSQL:数据子集中的非空替换,postgresql,interpolation,Postgresql,Interpolation,编辑我的错误,称为时间戳“日期” 我们的数据表由时间戳、值和增量列组成。delta是自上次非空读取以来的分钟数 CREATE TABLE Table1 ("ts" timestamp with time zone, "value" numeric, "delta" int) ; INSERT INTO Table1 ("ts", "value", "delta") VALUES ('2019-09-09 12:01:00', 3.5, NULL), ('2019

编辑我的错误,称为时间戳“日期”

我们的数据表由时间戳、值和增量列组成。delta是自上次非空读取以来的分钟数

CREATE TABLE Table1
    ("ts" timestamp with time zone, "value" numeric, "delta" int)
;

INSERT INTO Table1
    ("ts", "value", "delta")
VALUES
    ('2019-09-09 12:01:00', 3.5, NULL),
    ('2019-09-09 12:02:00', 3.2, 1),
    ('2019-09-09 12:03:00', NULL, 1),
    ('2019-09-09 12:04:00', 2.9, 2),
    ('2019-09-09 12:05:00', NULL, 1),
    ('2019-09-09 12:06:00', 3.0, 2),
    ('2019-09-09 12:07:00', NULL, 1),
    ('2019-09-09 12:08:00', NULL, 2),
    ('2019-09-09 12:09:00', NULL, 3),
    ('2019-09-09 12:10:00', NULL, 4),
    ('2019-09-09 12:11:00', 3.2, 5),
    ('2019-09-09 12:12:00', NULL, 1)
;
SELECT ts,
       value,
       delta,
  FROM table

+---------------------+-------+-------+
| ts                  | value | delta |
+---------------------+-------+-------+
| 2019-09-09 12:01:00 | 3.5   | 1     |
| 2019-09-09 12:02:00 | 3.2   | 1     |
| 2019-09-09 12:03:00 |       | 1     |
| 2019-09-09 12:04:00 | 2.9   | 2     |
| 2019-09-09 12:05:00 |       | 1     |
| 2019-09-09 12:06:00 | 3.0   | 2     |
| 2019-09-09 12:07:00 |       | 1     |
| 2019-09-09 12:08:00 |       | 2     |
| 2019-09-09 12:09:00 |       | 3     |
| 2019-09-09 12:10:00 |       | 4     |
| 2019-09-09 12:11:00 | 3.2   | 5     |
| 2019-09-09 12:12:00 |       | 1     |
+---------------------+-------+-------+
给定数据的子集,如果尚未选择替换值,我们如何用最后一个非空值替换空值:

我们希望:

+---------------------+-------+-------+
| ts                  | value | delta |
+---------------------+-------+-------+
| 2019-09-09 12:01:00 | 3.5   |       |
| 2019-09-09 12:03:00 | 3.2   | 1     |
| 2019-09-09 12:05:00 | 2.9   | 1     |
| 2019-09-09 12:07:00 | 3.0   | 1     |
| 2019-09-09 12:09:00 |       | 3     |<- an actual null
| 2019-09-09 12:11:00 | 3.2   | 5     |
+---------------------+-------+-------+
当我们为数据子集分配行时,我们现在既有值,也有返回该值的行数。我们不知道的是,当我们通过WHERE生成子集时,如何确定是否将值向前传递或保留为null

如果解决方案不需要预定义的增量列,则奖励积分。

当值为空时使用sumcase的想法是0,否则1将按日期结束订单,因为值p是一个很好的值。这会将值排序为具有相同值的组

从这里开始,如果将日期视为实际时间戳,则可以使用tsrangemindate、maxdate、“[]”将日期分组在一起。确保范围的结尾包含在内,以捕获组的开头和结尾同时出现的行

然后,只需使用contained by操作符连接到测试日期

WITH test_dates(test_date) as (VALUES 
        ('2019-09-09 12:01:00'::timestamp),
        ('2019-09-09 12:03:00'),
        ('2019-09-09 12:05:00'),
        ('2019-09-09 12:07:00'),
        ('2019-09-09 12:09:00'),
        ('2019-09-09 12:11:00')
), value_ranges AS (
    SELECT tsrange(min(date)::timestamp, max(date)::timestamp, '[]') as sample_range, 
       max(value) as value, -- There's only one non-null value, this could be min
       value_p
    FROM (
       SELECT date,
       value,
       sum(case when value is null then 0 else 1 end) over
            (order by date) as value_p
       FROM table1
    ) sub 
    GROUP BY value_p
)
SELECT test_date, 
       CASE WHEN row_number() OVER (PARTITION BY value_p ORDER BY test_date) = 1 THEN value 
       ELSE null END  -- Only the first row of the group is non-null
FROM test_dates
JOIN value_ranges on test_date <@ sample_range
;
没有必要使用delta列

更新:意识到我在样本参考点之后拉取条目,而我本应该在之前拉取条目。固定的

根据您的表格,并假设您想要的是时间戳,而不是日期,这将为您提供您想要的。只需更改第一个表表达式中的minutes\u between\u interval列,即可分散样本

为了提高可读性,我让CTE变得比需要的更详细

WITH with_offsets AS (

  -- First add in some metadata about how many minutes have elapsed since you
  -- started sampling along with a constant for the sampling interval.

  SELECT
    2 AS minutes_between_intervals, -- This is how often you're sampling
    date,
    value,
    delta,
    extract(minute FROM date - (min(date) OVER (ORDER BY date)))::integer AS minutes_offset
  FROM Table1

), with_groups AS (

  -- Add grouping, setting the sample entries as reference points and the
  -- entries leading up to it as part of its group.

  SELECT
    *,
    CASE WHEN minutes_offset % minutes_between_intervals = 0 THEN minutes_offset
         ELSE minutes_offset + (minutes_between_intervals - (minutes_offset % minutes_between_intervals))
    END AS sample_group,
    minutes_offset % minutes_between_intervals = 0 AS is_sample_boundary
  FROM with_offsets

), with_arrays AS (

  -- Then aggregate them into arrays. The values array has all NULLs
  -- removed. The groups with sample entries are marked.

  SELECT
    array_agg(date) AS dates,
    array_agg(value) FILTER (WHERE value IS NOT NULL) AS values,
    array_agg(delta) AS deltas,
    bool_or(is_sample_boundary) AS has_complete_sample
  FROM with_groups
  GROUP BY sample_group
)

-- Now take the last entry from each array, which will be the sample date,
-- the last recorded value, and the last recorded sample delta.

SELECT
  dates[array_upper(dates, 1)] AS date,
  values[array_upper(values, 1)] AS value,
  deltas[array_upper(deltas, 1)] AS delta
FROM with_arrays
WHERE has_complete_sample;

令人头昏眼花的东西。对于小数据集来说效果很好,但随着数据的增长,成本会越来越高,例如从25000行中抽出500行=1.5秒。这是在用更快的mindate、maxdate和JOIN取代tsrange之后。。。在test_date>=min和Yes时,加入这些将更快,除非您将日期存储为实际时间戳,我建议您这样做。所有这些聚合肯定会很昂贵,但要推荐任何进一步的优化,我们必须在一个新问题中看到explain analyze的输出。日期实际上是时间戳。我最初的编辑和小提琴被错误地设置为“日期”为varchar…我们事先不知道采样结果如何。它是基于原始数据池大小的变量,并且可能不是偶数增量。
WITH test_dates(test_date) as (VALUES 
        ('2019-09-09 12:01:00'::timestamp),
        ('2019-09-09 12:03:00'),
        ('2019-09-09 12:05:00'),
        ('2019-09-09 12:07:00'),
        ('2019-09-09 12:09:00'),
        ('2019-09-09 12:11:00')
), value_ranges AS (
    SELECT tsrange(min(date)::timestamp, max(date)::timestamp, '[]') as sample_range, 
       max(value) as value, -- There's only one non-null value, this could be min
       value_p
    FROM (
       SELECT date,
       value,
       sum(case when value is null then 0 else 1 end) over
            (order by date) as value_p
       FROM table1
    ) sub 
    GROUP BY value_p
)
SELECT test_date, 
       CASE WHEN row_number() OVER (PARTITION BY value_p ORDER BY test_date) = 1 THEN value 
       ELSE null END  -- Only the first row of the group is non-null
FROM test_dates
JOIN value_ranges on test_date <@ sample_range
;
WITH with_offsets AS (

  -- First add in some metadata about how many minutes have elapsed since you
  -- started sampling along with a constant for the sampling interval.

  SELECT
    2 AS minutes_between_intervals, -- This is how often you're sampling
    date,
    value,
    delta,
    extract(minute FROM date - (min(date) OVER (ORDER BY date)))::integer AS minutes_offset
  FROM Table1

), with_groups AS (

  -- Add grouping, setting the sample entries as reference points and the
  -- entries leading up to it as part of its group.

  SELECT
    *,
    CASE WHEN minutes_offset % minutes_between_intervals = 0 THEN minutes_offset
         ELSE minutes_offset + (minutes_between_intervals - (minutes_offset % minutes_between_intervals))
    END AS sample_group,
    minutes_offset % minutes_between_intervals = 0 AS is_sample_boundary
  FROM with_offsets

), with_arrays AS (

  -- Then aggregate them into arrays. The values array has all NULLs
  -- removed. The groups with sample entries are marked.

  SELECT
    array_agg(date) AS dates,
    array_agg(value) FILTER (WHERE value IS NOT NULL) AS values,
    array_agg(delta) AS deltas,
    bool_or(is_sample_boundary) AS has_complete_sample
  FROM with_groups
  GROUP BY sample_group
)

-- Now take the last entry from each array, which will be the sample date,
-- the last recorded value, and the last recorded sample delta.

SELECT
  dates[array_upper(dates, 1)] AS date,
  values[array_upper(values, 1)] AS value,
  deltas[array_upper(deltas, 1)] AS delta
FROM with_arrays
WHERE has_complete_sample;