Postgresql 窗口函数滞后是否可以引用正在计算的列？_Postgresql_Gaps And Islands

Postgresql 窗口函数滞后是否可以引用正在计算的列？

postgresql

Postgresql 窗口函数滞后是否可以引用正在计算的列？,postgresql,gaps-and-islands,Postgresql,Gaps And Islands,我需要根据当前记录的其他一些列和前一个记录的X值（使用一些分区和顺序）计算一些列X的值。基本上，我需要在表单中实现查询 SELECT <some fields>, <some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X FROM <table> 我不想找到“重复”事件（跳过它们）。我所说的复制是指以下内容。让我们按时间戳升序对给定类型的所有事件进行排序。然后第一个事件不是

我需要根据当前记录的其他一些列和前一个记录的X值（使用一些分区和顺序）计算一些列X的值。基本上，我需要在表单中实现查询

SELECT <some fields>, 
  <some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>

我不想找到“重复”事件（跳过它们）。我所说的复制是指以下内容。让我们按

时间戳

升序对给定

类型

的所有事件进行排序。然后

第一个事件不是重复的

在非重复事件之后且在其之后某个时间范围内的所有事件（即其

时间戳

不大于前一个非重复事件的

时间戳

加上一些常量

时间范围

）都是重复事件

下一个

time\u stamp

比上一个非重复事件大超过

TIMEFRAME

的事件不重复

等等

对于此数据

insert into event (type, time_stamp) 
 values 
  (1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10), 
  (1,15), (1, 21), (2,13), 
  (1, 40);

并且

TIMEFRAME=10

结果应为

time_stamp | type | duplicate
-----------------------------
        1  |    1 | false
        2  |    1 | true     
        3  |    1 | true 
       10  |    1 | true 
       15  |    1 | false 
       21  |    1 | true
       40  |    1 | false
        2  |    2 | false
       10  |    2 | true
       13  |    2 | false

我可以根据前一个非重复事件的当前

时间戳

和

时间戳

计算

重复

字段的值，如下所示：

WITH evt AS (
  SELECT 
    time_stamp, 
    CASE WHEN 
      time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
    THEN 
      time_stamp
    ELSE
      LAG(current_non_dupl_time_stamp) OVER w
    END AS current_non_dupl_time_stamp
  FROM event
  WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate

但这不起作用，因为计算的字段不能在

LAG

中引用：

ERROR:  column "current_non_dupl_time_stamp" does not exist.

所以问题是：我可以重写这个查询以达到我需要的效果吗？

这感觉更像是一个递归问题，而不是窗口函数。以下查询获得了所需的结果：

WITH RECURSIVE base(type, time_stamp) AS (

  -- 3. base of recursive query
  SELECT x.type, x.time_stamp, y.next_time_stamp
    FROM 
         -- 1. start with the initial records of each type   
         ( SELECT type, min(time_stamp) AS time_stamp
             FROM event
             GROUP BY type
         ) x
         LEFT JOIN LATERAL
         -- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
         ( SELECT MIN(time_stamp) next_time_stamp
             FROM event
             WHERE type = x.type
               AND time_stamp > (x.time_stamp + 10)
         ) y ON true

  UNION ALL

  -- 4. recursive join, same logic as base
  SELECT e.type, e.time_stamp, z.next_time_stamp
    FROM event e
    JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
    LEFT JOIN LATERAL
    ( SELECT MIN(time_stamp) next_time_stamp
       FROM event
       WHERE type = e.type
         AND time_stamp > (e.time_stamp + 10)
    ) z ON true

)

-- The actual query:

-- 5a. All records from base are not duplicates
SELECT time_stamp, type, false
  FROM base

UNION

-- 5b. All records from event that are not in base are duplicates
SELECT time_stamp, type, true
  FROM event
  WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base) 

ORDER BY type, time_stamp

这方面有很多警告。对于给定的

类型

，它假定没有重复的

时间戳

。实际上，连接应该基于唯一的id，而不是

类型

和

时间戳

。我没有做过这么多的测试，但它至少可以提供一种方法

这是我第一次尝试加入。因此，可能有一种方法可以简化moe。实际上，我想做的是一个递归CTE，其中递归部分使用

MIN（time\u stamp）

基于

time\u stamp>（x.time\u stamp+10）

，但CTE中不允许以这种方式使用聚合函数。但似乎横向连接可用于CTE。

朴素的递归链式编织机：

递归方法的替代方法是自定义聚合。一旦您掌握了编写自己的聚合的技术，创建转换和最终函数就变得简单且符合逻辑

状态转换函数：

create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
    if st is null or st[1] + timeframe <= time_stamp
    then 
        st[1] := time_stamp;
    end if;
    st[2] := time_stamp;
    return st;
end $$;

create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
    select st[1] <> st[2];
$$;

查询：

select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;

 id | type | time_stamp | is_duplicate_agg 
----+------+------------+------------------
  1 |    1 |          1 | f
  2 |    1 |          2 | t
  4 |    1 |          3 | t
  5 |    1 |         10 | t
  7 |    1 |         15 | f
  8 |    1 |         21 | t
 10 |    1 |         40 | f
  3 |    2 |          2 | f
  6 |    2 |         10 | t
  9 |    2 |         13 | f
(10 rows)

阅读文档：

我无法理解时间框架部分。特别是这一部分：

下一个时间戳大于前一个非重复时间帧的事件不重复

。时间框架是常数、字段还是计算？

timeframe

是常数。其基本原理是，如果事件发生在未跳过的前一个事件之后的给定时间范围内，我希望跳过该事件。您希望的输出包含时间戳40，但您的示例数据集没有？你能澄清一下吗？你是对的，那是个错误。

create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
    if st is null or st[1] + timeframe <= time_stamp
    then 
        st[1] := time_stamp;
    end if;
    st[2] := time_stamp;
    return st;
end $$;

create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
    select st[1] <> st[2];
$$;

create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
    sfunc = is_duplicate,
    stype = int[],
    finalfunc = is_duplicate_final
);

select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;

 id | type | time_stamp | is_duplicate_agg 
----+------+------------+------------------
  1 |    1 |          1 | f
  2 |    1 |          2 | t
  4 |    1 |          3 | t
  5 |    1 |         10 | t
  7 |    1 |         15 | f
  8 |    1 |         21 | t
 10 |    1 |         40 | f
  3 |    2 |          2 | f
  6 |    2 |         10 | t
  9 |    2 |         13 | f
(10 rows)