PostgreSQL:数据子集中的非空替换
编辑我的错误,称为时间戳“日期” 我们的数据表由时间戳、值和增量列组成。delta是自上次非空读取以来的分钟数PostgreSQL:数据子集中的非空替换,postgresql,interpolation,Postgresql,Interpolation,编辑我的错误,称为时间戳“日期” 我们的数据表由时间戳、值和增量列组成。delta是自上次非空读取以来的分钟数 CREATE TABLE Table1 ("ts" timestamp with time zone, "value" numeric, "delta" int) ; INSERT INTO Table1 ("ts", "value", "delta") VALUES ('2019-09-09 12:01:00', 3.5, NULL), ('2019
CREATE TABLE Table1
("ts" timestamp with time zone, "value" numeric, "delta" int)
;
INSERT INTO Table1
("ts", "value", "delta")
VALUES
('2019-09-09 12:01:00', 3.5, NULL),
('2019-09-09 12:02:00', 3.2, 1),
('2019-09-09 12:03:00', NULL, 1),
('2019-09-09 12:04:00', 2.9, 2),
('2019-09-09 12:05:00', NULL, 1),
('2019-09-09 12:06:00', 3.0, 2),
('2019-09-09 12:07:00', NULL, 1),
('2019-09-09 12:08:00', NULL, 2),
('2019-09-09 12:09:00', NULL, 3),
('2019-09-09 12:10:00', NULL, 4),
('2019-09-09 12:11:00', 3.2, 5),
('2019-09-09 12:12:00', NULL, 1)
;
SELECT ts,
value,
delta,
FROM table
+---------------------+-------+-------+
| ts | value | delta |
+---------------------+-------+-------+
| 2019-09-09 12:01:00 | 3.5 | 1 |
| 2019-09-09 12:02:00 | 3.2 | 1 |
| 2019-09-09 12:03:00 | | 1 |
| 2019-09-09 12:04:00 | 2.9 | 2 |
| 2019-09-09 12:05:00 | | 1 |
| 2019-09-09 12:06:00 | 3.0 | 2 |
| 2019-09-09 12:07:00 | | 1 |
| 2019-09-09 12:08:00 | | 2 |
| 2019-09-09 12:09:00 | | 3 |
| 2019-09-09 12:10:00 | | 4 |
| 2019-09-09 12:11:00 | 3.2 | 5 |
| 2019-09-09 12:12:00 | | 1 |
+---------------------+-------+-------+
给定数据的子集,如果尚未选择替换值,我们如何用最后一个非空值替换空值:
我们希望:
+---------------------+-------+-------+
| ts | value | delta |
+---------------------+-------+-------+
| 2019-09-09 12:01:00 | 3.5 | |
| 2019-09-09 12:03:00 | 3.2 | 1 |
| 2019-09-09 12:05:00 | 2.9 | 1 |
| 2019-09-09 12:07:00 | 3.0 | 1 |
| 2019-09-09 12:09:00 | | 3 |<- an actual null
| 2019-09-09 12:11:00 | 3.2 | 5 |
+---------------------+-------+-------+
当我们为数据子集分配行时,我们现在既有值,也有返回该值的行数。我们不知道的是,当我们通过WHERE生成子集时,如何确定是否将值向前传递或保留为null
如果解决方案不需要预定义的增量列,则奖励积分。当值为空时使用sumcase的想法是0,否则1将按日期结束订单,因为值p是一个很好的值。这会将值排序为具有相同值的组
从这里开始,如果将日期视为实际时间戳,则可以使用tsrangemindate、maxdate、“[]”将日期分组在一起。确保范围的结尾包含在内,以捕获组的开头和结尾同时出现的行
然后,只需使用contained by操作符连接到测试日期
WITH test_dates(test_date) as (VALUES
('2019-09-09 12:01:00'::timestamp),
('2019-09-09 12:03:00'),
('2019-09-09 12:05:00'),
('2019-09-09 12:07:00'),
('2019-09-09 12:09:00'),
('2019-09-09 12:11:00')
), value_ranges AS (
SELECT tsrange(min(date)::timestamp, max(date)::timestamp, '[]') as sample_range,
max(value) as value, -- There's only one non-null value, this could be min
value_p
FROM (
SELECT date,
value,
sum(case when value is null then 0 else 1 end) over
(order by date) as value_p
FROM table1
) sub
GROUP BY value_p
)
SELECT test_date,
CASE WHEN row_number() OVER (PARTITION BY value_p ORDER BY test_date) = 1 THEN value
ELSE null END -- Only the first row of the group is non-null
FROM test_dates
JOIN value_ranges on test_date <@ sample_range
;
没有必要使用delta列
更新:意识到我在样本参考点之后拉取条目,而我本应该在之前拉取条目。固定的
根据您的表格,并假设您想要的是时间戳,而不是日期,这将为您提供您想要的。只需更改第一个表表达式中的minutes\u between\u interval列,即可分散样本
为了提高可读性,我让CTE变得比需要的更详细
WITH with_offsets AS (
-- First add in some metadata about how many minutes have elapsed since you
-- started sampling along with a constant for the sampling interval.
SELECT
2 AS minutes_between_intervals, -- This is how often you're sampling
date,
value,
delta,
extract(minute FROM date - (min(date) OVER (ORDER BY date)))::integer AS minutes_offset
FROM Table1
), with_groups AS (
-- Add grouping, setting the sample entries as reference points and the
-- entries leading up to it as part of its group.
SELECT
*,
CASE WHEN minutes_offset % minutes_between_intervals = 0 THEN minutes_offset
ELSE minutes_offset + (minutes_between_intervals - (minutes_offset % minutes_between_intervals))
END AS sample_group,
minutes_offset % minutes_between_intervals = 0 AS is_sample_boundary
FROM with_offsets
), with_arrays AS (
-- Then aggregate them into arrays. The values array has all NULLs
-- removed. The groups with sample entries are marked.
SELECT
array_agg(date) AS dates,
array_agg(value) FILTER (WHERE value IS NOT NULL) AS values,
array_agg(delta) AS deltas,
bool_or(is_sample_boundary) AS has_complete_sample
FROM with_groups
GROUP BY sample_group
)
-- Now take the last entry from each array, which will be the sample date,
-- the last recorded value, and the last recorded sample delta.
SELECT
dates[array_upper(dates, 1)] AS date,
values[array_upper(values, 1)] AS value,
deltas[array_upper(deltas, 1)] AS delta
FROM with_arrays
WHERE has_complete_sample;
令人头昏眼花的东西。对于小数据集来说效果很好,但随着数据的增长,成本会越来越高,例如从25000行中抽出500行=1.5秒。这是在用更快的mindate、maxdate和JOIN取代tsrange之后。。。在test_date>=min和Yes时,加入这些将更快,除非您将日期存储为实际时间戳,我建议您这样做。所有这些聚合肯定会很昂贵,但要推荐任何进一步的优化,我们必须在一个新问题中看到explain analyze的输出。日期实际上是时间戳。我最初的编辑和小提琴被错误地设置为“日期”为varchar…我们事先不知道采样结果如何。它是基于原始数据池大小的变量,并且可能不是偶数增量。
WITH test_dates(test_date) as (VALUES
('2019-09-09 12:01:00'::timestamp),
('2019-09-09 12:03:00'),
('2019-09-09 12:05:00'),
('2019-09-09 12:07:00'),
('2019-09-09 12:09:00'),
('2019-09-09 12:11:00')
), value_ranges AS (
SELECT tsrange(min(date)::timestamp, max(date)::timestamp, '[]') as sample_range,
max(value) as value, -- There's only one non-null value, this could be min
value_p
FROM (
SELECT date,
value,
sum(case when value is null then 0 else 1 end) over
(order by date) as value_p
FROM table1
) sub
GROUP BY value_p
)
SELECT test_date,
CASE WHEN row_number() OVER (PARTITION BY value_p ORDER BY test_date) = 1 THEN value
ELSE null END -- Only the first row of the group is non-null
FROM test_dates
JOIN value_ranges on test_date <@ sample_range
;
WITH with_offsets AS (
-- First add in some metadata about how many minutes have elapsed since you
-- started sampling along with a constant for the sampling interval.
SELECT
2 AS minutes_between_intervals, -- This is how often you're sampling
date,
value,
delta,
extract(minute FROM date - (min(date) OVER (ORDER BY date)))::integer AS minutes_offset
FROM Table1
), with_groups AS (
-- Add grouping, setting the sample entries as reference points and the
-- entries leading up to it as part of its group.
SELECT
*,
CASE WHEN minutes_offset % minutes_between_intervals = 0 THEN minutes_offset
ELSE minutes_offset + (minutes_between_intervals - (minutes_offset % minutes_between_intervals))
END AS sample_group,
minutes_offset % minutes_between_intervals = 0 AS is_sample_boundary
FROM with_offsets
), with_arrays AS (
-- Then aggregate them into arrays. The values array has all NULLs
-- removed. The groups with sample entries are marked.
SELECT
array_agg(date) AS dates,
array_agg(value) FILTER (WHERE value IS NOT NULL) AS values,
array_agg(delta) AS deltas,
bool_or(is_sample_boundary) AS has_complete_sample
FROM with_groups
GROUP BY sample_group
)
-- Now take the last entry from each array, which will be the sample date,
-- the last recorded value, and the last recorded sample delta.
SELECT
dates[array_upper(dates, 1)] AS date,
values[array_upper(values, 1)] AS value,
deltas[array_upper(deltas, 1)] AS delta
FROM with_arrays
WHERE has_complete_sample;