Sql &引用;会期“;事件流

Sql &引用;会期“;事件流,sql,session,amazon-redshift,event-stream,Sql,Session,Amazon Redshift,Event Stream,我有一个问题应该在SQL之外解决,但由于业务限制,需要在SQL内部解决 所以,请不要告诉我在数据摄取时这样做,在SQL之外,我想这样做,但这不是一个选项 我有一个事件流,有4个主要属性 源设备 事件的时间戳 事件的“类型” 事件的“有效负载”(表示各种数据类型的可怕的VARCHAR) 我需要做的是将流分解成若干部分(我将称之为“会话”) 每个会话都特定于一个设备(实际上是按设备id划分的) 任何会话都不能包含多个相同类型的事件 为了缩短示例,我将限制它们仅包括时间戳和事件类型

我有一个问题应该在SQL之外解决,但由于业务限制,需要在SQL内部解决

  • 所以,请不要告诉我在数据摄取时这样做,在SQL之外,我想这样做,但这不是一个选项

我有一个事件流,有4个主要属性

  • 源设备
  • 事件的时间戳
  • 事件的“类型”
  • 事件的“有效负载”(表示各种数据类型的可怕的VARCHAR)

我需要做的是将流分解成若干部分(我将称之为“会话”)

  • 每个会话都特定于一个设备(实际上是按设备id划分的
  • 任何会话都不能包含多个相同类型的事件

为了缩短示例,我将限制它们仅包括时间戳和事件类型

 timestamp | event_type          desired_session_id
-----------+------------        --------------------
     0     |     1                      0
     1     |     4                      0
     2     |     2                      0
     3     |     3                      0

     4     |     2                      1
     5     |     1                      1
     6     |     3                      1
     7     |     4                      1

     8     |     4                      2

     9     |     4                      3
    10     |     1                      3

    11     |     1                      4
    12     |     2                      4
理想化的最终输出可能是以最终结果为中心

device_id | session_id | event_type_1_timestamp | event_type_1_payload |  event_type_2_timestamp | event_type_2_payload ...
(但这还不是一成不变的,但我需要“知道”哪些事件组成了会话,它们的时间戳是什么,它们的有效负载是什么。只要我不“丢失”其他属性,就可能在输入中添加session_id列就够了。)


有:

  • 12离散事件类型
  • 数十万台设备
  • 每个设备上有数十万个事件
  • 每个“会议”大约6-8个事件的“标准”
  • 但有时一个会话可能只有1个或全部12个
这些因素意味着半笛卡尔产品和类似产品,嗯,不太理想,但可能是“唯一的方法”


我(在我的头脑中)玩过分析函数、间隙和孤岛类型的过程,但从来没有真正做到这一点。我总是回到一个地方,在那里我“想要”一些旗帜,我可以从一行转到另一行,并根据需要重置它们

在SQL中不起作用的PSEDOO代码

flags = [0,0,0,0,0,0,0,0,0]
session_id = 0
for each row in stream
   if flags[row.event_id] == 0 then
       flags[row.event_id] = 1
   else
       session_id++
       flags = [0,0,0,0,0,0,0,0,0]
   row.session_id = session_id
任何SQL解决方案都是值得赞赏的,但如果您还可以考虑“同时发生”的事件,您将获得“加分”


我不是100%确定这可以在SQL中完成。但我有一个可能有效的算法:

  • 枚举每个事件的计数
  • 将每个点的最大计数作为事件的“分组”(这是会话)
因此:

编辑:

这篇评论太长了。我感觉这需要一个递归的CTE(RBAR)。这是因为您不能在一行着陆并查看累积信息或相邻信息来确定该行是否应启动新会话

当然,在某些情况下这是显而易见的(比如,前一行有相同的事件)。而且,也有可能存在某种巧妙的方法来聚合以前的数据,从而使之成为可能

编辑二:

如果没有递归CTE(RBAR),我认为这是不可能的。这不是一个数学证明,但这就是我直觉的来源

假设您从当前视图向后看4行,您有:

1
2
1
2
1  <-- current row
1
2.
1.
2.
1UPD基于讨论(未检查/测试,粗略设想):

其中
f_get_session_标志

create or replace function f_get_session_flag(arr varchar(max))
returns boolean
stable as $$
stream = arr.split(',')
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = False
for row in stream:
   if flags[row.event_id] == 0:
       flags[row.event_id] = 1
       is_new_session = False
   else:
       session_id+=1
       flags = [0,0,0,0,0,0,0,0,0,0,0,0]
       is_new_session = True
return is_new_session
$$ language plpythonu;

答复:

这些标志可以复制为事件运行计数的除法余数和2:

1 -> 1%2 = 1
2 -> 2%2 = 0
3 -> 3%2 = 1
4 -> 4%2 = 0
5 -> 5%2 = 1
6 -> 6%2 = 0
并连接成一个位掩码(类似于伪码中的
标志
数组)。唯一棘手的问题是何时将所有标志重置为零,并启动新的会话ID,但我可以非常接近。如果您的示例表名为
t
,并且它有
ts
type
列,则脚本可能如下所示:

with
-- running count of the events
t1 as (
    select
     *
    ,sum(case when type=1 then 1 else 0 end) over (order by ts) as type_1_cnt
    ,sum(case when type=2 then 1 else 0 end) over (order by ts) as type_2_cnt
    ,sum(case when type=3 then 1 else 0 end) over (order by ts) as type_3_cnt
    ,sum(case when type=4 then 1 else 0 end) over (order by ts) as type_4_cnt
    from t
)
-- mask
,t2 as (
    select
     *
    ,case when type_1_cnt%2=0 then '0' else '1' end ||
     case when type_2_cnt%2=0 then '0' else '1' end ||
     case when type_3_cnt%2=0 then '0' else '1' end ||
     case when type_4_cnt%2=0 then '0' else '1' end as flags
    from t1
)
-- previous row's mask
,t3 as (
    select
     *
    ,lag(flags) over (order by ts) as flags_prev
    from t2
)
-- reset the mask if there is a switch from 1 to 0 at any position
,t4 as (
    select *
    ,case
        when (substring(flags from 1 for 1)='0' and substring(flags_prev from 1 for 1)='1')
        or (substring(flags from 2 for 1)='0' and substring(flags_prev from 2 for 1)='1')
        or (substring(flags from 3 for 1)='0' and substring(flags_prev from 3 for 1)='1')
        or (substring(flags from 4 for 1)='0' and substring(flags_prev from 4 for 1)='1')
        then '0000'
        else flags
     end as flags_override
    from t3
)
-- get the previous value of the reset mask and same event type flag for corner case 
,t5 as (
    select *
    ,lag(flags_override) over (order by ts) as flags_override_prev
    ,type=lag(type) over (order by ts) as same_event_type
    from t4
)
-- again, session ID is a switch from 1 to 0 OR same event type (that can be a switch from 0 to 1)
select
 ts
,type
,sum(case
 when (substring(flags_override from 1 for 1)='0' and substring(flags_override_prev from 1 for 1)='1')
        or (substring(flags_override from 2 for 1)='0' and substring(flags_override_prev from 2 for 1)='1')
        or (substring(flags_override from 3 for 1)='0' and substring(flags_override_prev from 3 for 1)='1')
        or (substring(flags_override from 4 for 1)='0' and substring(flags_override_prev from 4 for 1)='1')
        or same_event_type
        then 1
        else 0 end
 ) over (order by ts) as session_id
from t5
order by ts
;

您可以添加必要的分区并扩展到12种事件类型,此代码旨在处理您提供的示例表。。。这并不完美,如果您运行子查询,您将看到标志重置的频率比需要的更高,但总体而言,它可以工作,除了会话id 2的一个事件类型为4的情况,在另一个会话结束后,具有相同事件类型为4的情况,因此,我在
相同的事件类型中添加了一个简单的查找,并将其用作新会话id的另一个条件,希望这能在更大的数据集上工作。

我决定使用的解决方案是通过将实际会话延迟到用python编写的标量函数来有效地“不要在SQL中执行”

--
-- The input parameter should be a comma delimited list of identifiers
-- Each identified should be a "power of 2" value, no lower than 1
-- (1, 2, 4, 8, 16, 32, 64, 128, etc, etc)
--
-- The input '1,2,4,2,1,1,4' will give the output '0001010'
--
CREATE OR REPLACE FUNCTION public.f_indentify_collision_indexes(arr varchar(max))
RETURNS VARCHAR(MAX)
STABLE AS
$$
    stream = map(int, arr.split(','))
    state = 0
    collisions = []
    item_id = 1
    for item in stream:
        if (state & item) == (item):
            collisions.append('1')
            state = item
        else:
            state |= item
            collisions.append('0')
        item_id += 1

    return ''.join(collisions)
$$
LANGUAGE plpythonu;
注意:如果有数百种事件类型,我不会使用此选项;)


实际上,我按顺序传递事件的数据结构,返回的是新会话开始位置的数据结构

我选择了实际的数据结构,以便尽可能地简化SQL方面的工作。(可能不是最好的,对其他想法非常开放。)

  • 断言事件的确定顺序(封装同时发生的事件等)
    ROW\u NUMBER()作为会话\u事件\u序列\u id覆盖(填充)

  • 创建以逗号分隔的事件类型id列表
    LISTAGG(事件类型id,,)
    =>
    '1,2,4,8,2,1,4,1,4,4,1,1'

  • 使用python计算边界
    public.f_-magic('1,2,4,8,2,1,4,4,1,1')
    =>
    '000010010101'

  • 对于序列中的第一个事件,计算“边界”中第一个字符之前的1数。对于序列中的第二个事件,计算1的数量,包括边界中的第二个字符等。
    事件01=1
    =>
    边界='0'
    =>
    会话id=0

    事件02=2
    =>
    边界='00'
    =>
    会话id=0

    事件03=4
    =>
    边界='000'
    =>
    会话id=0

    事件04=8create or replace function f_get_session_flag(arr varchar(max))
    returns boolean
    stable as $$
    stream = arr.split(',')
    flags = [0,0,0,0,0,0,0,0,0,0,0,0]
    is_new_session = False
    for row in stream:
       if flags[row.event_id] == 0:
           flags[row.event_id] = 1
           is_new_session = False
       else:
           session_id+=1
           flags = [0,0,0,0,0,0,0,0,0,0,0,0]
           is_new_session = True
    return is_new_session
    $$ language plpythonu;
    
    1 -> 1%2 = 1
    2 -> 2%2 = 0
    3 -> 3%2 = 1
    4 -> 4%2 = 0
    5 -> 5%2 = 1
    6 -> 6%2 = 0
    
    with
    -- running count of the events
    t1 as (
        select
         *
        ,sum(case when type=1 then 1 else 0 end) over (order by ts) as type_1_cnt
        ,sum(case when type=2 then 1 else 0 end) over (order by ts) as type_2_cnt
        ,sum(case when type=3 then 1 else 0 end) over (order by ts) as type_3_cnt
        ,sum(case when type=4 then 1 else 0 end) over (order by ts) as type_4_cnt
        from t
    )
    -- mask
    ,t2 as (
        select
         *
        ,case when type_1_cnt%2=0 then '0' else '1' end ||
         case when type_2_cnt%2=0 then '0' else '1' end ||
         case when type_3_cnt%2=0 then '0' else '1' end ||
         case when type_4_cnt%2=0 then '0' else '1' end as flags
        from t1
    )
    -- previous row's mask
    ,t3 as (
        select
         *
        ,lag(flags) over (order by ts) as flags_prev
        from t2
    )
    -- reset the mask if there is a switch from 1 to 0 at any position
    ,t4 as (
        select *
        ,case
            when (substring(flags from 1 for 1)='0' and substring(flags_prev from 1 for 1)='1')
            or (substring(flags from 2 for 1)='0' and substring(flags_prev from 2 for 1)='1')
            or (substring(flags from 3 for 1)='0' and substring(flags_prev from 3 for 1)='1')
            or (substring(flags from 4 for 1)='0' and substring(flags_prev from 4 for 1)='1')
            then '0000'
            else flags
         end as flags_override
        from t3
    )
    -- get the previous value of the reset mask and same event type flag for corner case 
    ,t5 as (
        select *
        ,lag(flags_override) over (order by ts) as flags_override_prev
        ,type=lag(type) over (order by ts) as same_event_type
        from t4
    )
    -- again, session ID is a switch from 1 to 0 OR same event type (that can be a switch from 0 to 1)
    select
     ts
    ,type
    ,sum(case
     when (substring(flags_override from 1 for 1)='0' and substring(flags_override_prev from 1 for 1)='1')
            or (substring(flags_override from 2 for 1)='0' and substring(flags_override_prev from 2 for 1)='1')
            or (substring(flags_override from 3 for 1)='0' and substring(flags_override_prev from 3 for 1)='1')
            or (substring(flags_override from 4 for 1)='0' and substring(flags_override_prev from 4 for 1)='1')
            or same_event_type
            then 1
            else 0 end
     ) over (order by ts) as session_id
    from t5
    order by ts
    ;
    
    --
    -- The input parameter should be a comma delimited list of identifiers
    -- Each identified should be a "power of 2" value, no lower than 1
    -- (1, 2, 4, 8, 16, 32, 64, 128, etc, etc)
    --
    -- The input '1,2,4,2,1,1,4' will give the output '0001010'
    --
    CREATE OR REPLACE FUNCTION public.f_indentify_collision_indexes(arr varchar(max))
    RETURNS VARCHAR(MAX)
    STABLE AS
    $$
        stream = map(int, arr.split(','))
        state = 0
        collisions = []
        item_id = 1
        for item in stream:
            if (state & item) == (item):
                collisions.append('1')
                state = item
            else:
                state |= item
                collisions.append('0')
            item_id += 1
    
        return ''.join(collisions)
    $$
    LANGUAGE plpythonu;
    
    INSERT INTO
        sessionised_event_stream
    SELECT
        device_id,
        REGEXP_COUNT(
            LEFT(
                public.f_indentify_collision_indexes(
                    LISTAGG(event_type_id, ',')
                        WITHIN GROUP (ORDER BY session_event_sequence_id)
                        OVER (PARTITION BY device_id)
                ),
                session_event_sequence_id::INT
            ),
            '1',
            1
        ) + 1
            AS session_login_attempt_id,
        session_event_sequence_id,
        event_timestamp,
        event_type_id,
        event_data
    FROM
    (
        SELECT
            *,
            ROW_NUMBER()
                OVER (PARTITION BY device_id
                          ORDER BY event_timestamp, event_type_id, event_data)
                    AS session_event_sequence_id
        FROM
            event_stream
    )
    
    Write some SQL that can find "the next session" from any given stream.
    
    Run that SQL once storing the results in a temp table.
    => Now have the first session from every stream
    
    Run it again using the temp table as an input
    => We now also have the second session from every stream
    
    Keep repeating this until the SQL inserts 0 rows in to the temp table
    => We now have all the sessions from every stream