Sql &引用;会期“;事件流
我有一个问题应该在SQL之外解决,但由于业务限制,需要在SQL内部解决Sql &引用;会期“;事件流,sql,session,amazon-redshift,event-stream,Sql,Session,Amazon Redshift,Event Stream,我有一个问题应该在SQL之外解决,但由于业务限制,需要在SQL内部解决 所以,请不要告诉我在数据摄取时这样做,在SQL之外,我想这样做,但这不是一个选项 我有一个事件流,有4个主要属性 源设备 事件的时间戳 事件的“类型” 事件的“有效负载”(表示各种数据类型的可怕的VARCHAR) 我需要做的是将流分解成若干部分(我将称之为“会话”) 每个会话都特定于一个设备(实际上是按设备id划分的) 任何会话都不能包含多个相同类型的事件 为了缩短示例,我将限制它们仅包括时间戳和事件类型
- 所以,请不要告诉我在数据摄取时这样做,在SQL之外,我想这样做,但这不是一个选项
我有一个事件流,有4个主要属性
- 源设备
- 事件的时间戳
- 事件的“类型”
- 事件的“有效负载”(表示各种数据类型的可怕的VARCHAR)
我需要做的是将流分解成若干部分(我将称之为“会话”)
- 每个会话都特定于一个设备(实际上是按设备id划分的
)
- 任何会话都不能包含多个相同类型的事件
为了缩短示例,我将限制它们仅包括时间戳和事件类型
timestamp | event_type desired_session_id
-----------+------------ --------------------
0 | 1 0
1 | 4 0
2 | 2 0
3 | 3 0
4 | 2 1
5 | 1 1
6 | 3 1
7 | 4 1
8 | 4 2
9 | 4 3
10 | 1 3
11 | 1 4
12 | 2 4
理想化的最终输出可能是以最终结果为中心
device_id | session_id | event_type_1_timestamp | event_type_1_payload | event_type_2_timestamp | event_type_2_payload ...
(但这还不是一成不变的,但我需要“知道”哪些事件组成了会话,它们的时间戳是什么,它们的有效负载是什么。只要我不“丢失”其他属性,就可能在输入中添加session_id列就够了。)
有:
- 12离散事件类型
- 数十万台设备
- 每个设备上有数十万个事件
- 每个“会议”大约6-8个事件的“标准”
- 但有时一个会话可能只有1个或全部12个
我(在我的头脑中)玩过分析函数、间隙和孤岛类型的过程,但从来没有真正做到这一点。我总是回到一个地方,在那里我“想要”一些旗帜,我可以从一行转到另一行,并根据需要重置它们 在SQL中不起作用的PSEDOO代码
flags = [0,0,0,0,0,0,0,0,0]
session_id = 0
for each row in stream
if flags[row.event_id] == 0 then
flags[row.event_id] = 1
else
session_id++
flags = [0,0,0,0,0,0,0,0,0]
row.session_id = session_id
任何SQL解决方案都是值得赞赏的,但如果您还可以考虑“同时发生”的事件,您将获得“加分”
我不是100%确定这可以在SQL中完成。但我有一个可能有效的算法:
- 枚举每个事件的计数
- 将每个点的最大计数作为事件的“分组”(这是会话)
1
2
1
2
1 <-- current row
1
2.
1.
2.
1UPD基于讨论(未检查/测试,粗略设想):
其中f_get_session_标志
为
create or replace function f_get_session_flag(arr varchar(max))
returns boolean
stable as $$
stream = arr.split(',')
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = False
for row in stream:
if flags[row.event_id] == 0:
flags[row.event_id] = 1
is_new_session = False
else:
session_id+=1
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = True
return is_new_session
$$ language plpythonu;
答复:
这些标志可以复制为事件运行计数的除法余数和2:
1 -> 1%2 = 1
2 -> 2%2 = 0
3 -> 3%2 = 1
4 -> 4%2 = 0
5 -> 5%2 = 1
6 -> 6%2 = 0
并连接成一个位掩码(类似于伪码中的标志
数组)。唯一棘手的问题是何时将所有标志重置为零,并启动新的会话ID,但我可以非常接近。如果您的示例表名为t
,并且它有ts
和type
列,则脚本可能如下所示:
with
-- running count of the events
t1 as (
select
*
,sum(case when type=1 then 1 else 0 end) over (order by ts) as type_1_cnt
,sum(case when type=2 then 1 else 0 end) over (order by ts) as type_2_cnt
,sum(case when type=3 then 1 else 0 end) over (order by ts) as type_3_cnt
,sum(case when type=4 then 1 else 0 end) over (order by ts) as type_4_cnt
from t
)
-- mask
,t2 as (
select
*
,case when type_1_cnt%2=0 then '0' else '1' end ||
case when type_2_cnt%2=0 then '0' else '1' end ||
case when type_3_cnt%2=0 then '0' else '1' end ||
case when type_4_cnt%2=0 then '0' else '1' end as flags
from t1
)
-- previous row's mask
,t3 as (
select
*
,lag(flags) over (order by ts) as flags_prev
from t2
)
-- reset the mask if there is a switch from 1 to 0 at any position
,t4 as (
select *
,case
when (substring(flags from 1 for 1)='0' and substring(flags_prev from 1 for 1)='1')
or (substring(flags from 2 for 1)='0' and substring(flags_prev from 2 for 1)='1')
or (substring(flags from 3 for 1)='0' and substring(flags_prev from 3 for 1)='1')
or (substring(flags from 4 for 1)='0' and substring(flags_prev from 4 for 1)='1')
then '0000'
else flags
end as flags_override
from t3
)
-- get the previous value of the reset mask and same event type flag for corner case
,t5 as (
select *
,lag(flags_override) over (order by ts) as flags_override_prev
,type=lag(type) over (order by ts) as same_event_type
from t4
)
-- again, session ID is a switch from 1 to 0 OR same event type (that can be a switch from 0 to 1)
select
ts
,type
,sum(case
when (substring(flags_override from 1 for 1)='0' and substring(flags_override_prev from 1 for 1)='1')
or (substring(flags_override from 2 for 1)='0' and substring(flags_override_prev from 2 for 1)='1')
or (substring(flags_override from 3 for 1)='0' and substring(flags_override_prev from 3 for 1)='1')
or (substring(flags_override from 4 for 1)='0' and substring(flags_override_prev from 4 for 1)='1')
or same_event_type
then 1
else 0 end
) over (order by ts) as session_id
from t5
order by ts
;
您可以添加必要的分区并扩展到12种事件类型,此代码旨在处理您提供的示例表。。。这并不完美,如果您运行子查询,您将看到标志重置的频率比需要的更高,但总体而言,它可以工作,除了会话id 2的一个事件类型为4的情况,在另一个会话结束后,具有相同事件类型为4的情况,因此,我在相同的事件类型中添加了一个简单的查找,并将其用作新会话id的另一个条件,希望这能在更大的数据集上工作。我决定使用的解决方案是通过将实际会话延迟到用python编写的标量函数来有效地“不要在SQL中执行”
--
-- The input parameter should be a comma delimited list of identifiers
-- Each identified should be a "power of 2" value, no lower than 1
-- (1, 2, 4, 8, 16, 32, 64, 128, etc, etc)
--
-- The input '1,2,4,2,1,1,4' will give the output '0001010'
--
CREATE OR REPLACE FUNCTION public.f_indentify_collision_indexes(arr varchar(max))
RETURNS VARCHAR(MAX)
STABLE AS
$$
stream = map(int, arr.split(','))
state = 0
collisions = []
item_id = 1
for item in stream:
if (state & item) == (item):
collisions.append('1')
state = item
else:
state |= item
collisions.append('0')
item_id += 1
return ''.join(collisions)
$$
LANGUAGE plpythonu;
注意:如果有数百种事件类型,我不会使用此选项;)
实际上,我按顺序传递事件的数据结构,返回的是新会话开始位置的数据结构
我选择了实际的数据结构,以便尽可能地简化SQL方面的工作。(可能不是最好的,对其他想法非常开放。)
断言事件的确定顺序(封装同时发生的事件等)
ROW\u NUMBER()作为会话\u事件\u序列\u id覆盖(填充)
创建以逗号分隔的事件类型id列表
LISTAGG(事件类型id,,)
=>'1,2,4,8,2,1,4,1,4,4,1,1'
使用python计算边界
public.f_-magic('1,2,4,8,2,1,4,4,1,1')
=>'000010010101'
对于序列中的第一个事件,计算“边界”中第一个字符之前的1数。对于序列中的第二个事件,计算1的数量,包括边界中的第二个字符等。
事件01=1
=>边界='0'
=>会话id=0
事件02=2
=>边界='00'
=>会话id=0
事件03=4
=>边界='000'
=>会话id=0
事件04=8create or replace function f_get_session_flag(arr varchar(max))
returns boolean
stable as $$
stream = arr.split(',')
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = False
for row in stream:
if flags[row.event_id] == 0:
flags[row.event_id] = 1
is_new_session = False
else:
session_id+=1
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = True
return is_new_session
$$ language plpythonu;
1 -> 1%2 = 1
2 -> 2%2 = 0
3 -> 3%2 = 1
4 -> 4%2 = 0
5 -> 5%2 = 1
6 -> 6%2 = 0
with
-- running count of the events
t1 as (
select
*
,sum(case when type=1 then 1 else 0 end) over (order by ts) as type_1_cnt
,sum(case when type=2 then 1 else 0 end) over (order by ts) as type_2_cnt
,sum(case when type=3 then 1 else 0 end) over (order by ts) as type_3_cnt
,sum(case when type=4 then 1 else 0 end) over (order by ts) as type_4_cnt
from t
)
-- mask
,t2 as (
select
*
,case when type_1_cnt%2=0 then '0' else '1' end ||
case when type_2_cnt%2=0 then '0' else '1' end ||
case when type_3_cnt%2=0 then '0' else '1' end ||
case when type_4_cnt%2=0 then '0' else '1' end as flags
from t1
)
-- previous row's mask
,t3 as (
select
*
,lag(flags) over (order by ts) as flags_prev
from t2
)
-- reset the mask if there is a switch from 1 to 0 at any position
,t4 as (
select *
,case
when (substring(flags from 1 for 1)='0' and substring(flags_prev from 1 for 1)='1')
or (substring(flags from 2 for 1)='0' and substring(flags_prev from 2 for 1)='1')
or (substring(flags from 3 for 1)='0' and substring(flags_prev from 3 for 1)='1')
or (substring(flags from 4 for 1)='0' and substring(flags_prev from 4 for 1)='1')
then '0000'
else flags
end as flags_override
from t3
)
-- get the previous value of the reset mask and same event type flag for corner case
,t5 as (
select *
,lag(flags_override) over (order by ts) as flags_override_prev
,type=lag(type) over (order by ts) as same_event_type
from t4
)
-- again, session ID is a switch from 1 to 0 OR same event type (that can be a switch from 0 to 1)
select
ts
,type
,sum(case
when (substring(flags_override from 1 for 1)='0' and substring(flags_override_prev from 1 for 1)='1')
or (substring(flags_override from 2 for 1)='0' and substring(flags_override_prev from 2 for 1)='1')
or (substring(flags_override from 3 for 1)='0' and substring(flags_override_prev from 3 for 1)='1')
or (substring(flags_override from 4 for 1)='0' and substring(flags_override_prev from 4 for 1)='1')
or same_event_type
then 1
else 0 end
) over (order by ts) as session_id
from t5
order by ts
;
--
-- The input parameter should be a comma delimited list of identifiers
-- Each identified should be a "power of 2" value, no lower than 1
-- (1, 2, 4, 8, 16, 32, 64, 128, etc, etc)
--
-- The input '1,2,4,2,1,1,4' will give the output '0001010'
--
CREATE OR REPLACE FUNCTION public.f_indentify_collision_indexes(arr varchar(max))
RETURNS VARCHAR(MAX)
STABLE AS
$$
stream = map(int, arr.split(','))
state = 0
collisions = []
item_id = 1
for item in stream:
if (state & item) == (item):
collisions.append('1')
state = item
else:
state |= item
collisions.append('0')
item_id += 1
return ''.join(collisions)
$$
LANGUAGE plpythonu;
INSERT INTO
sessionised_event_stream
SELECT
device_id,
REGEXP_COUNT(
LEFT(
public.f_indentify_collision_indexes(
LISTAGG(event_type_id, ',')
WITHIN GROUP (ORDER BY session_event_sequence_id)
OVER (PARTITION BY device_id)
),
session_event_sequence_id::INT
),
'1',
1
) + 1
AS session_login_attempt_id,
session_event_sequence_id,
event_timestamp,
event_type_id,
event_data
FROM
(
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY device_id
ORDER BY event_timestamp, event_type_id, event_data)
AS session_event_sequence_id
FROM
event_stream
)
Write some SQL that can find "the next session" from any given stream.
Run that SQL once storing the results in a temp table.
=> Now have the first session from every stream
Run it again using the temp table as an input
=> We now also have the second session from every stream
Keep repeating this until the SQL inserts 0 rows in to the temp table
=> We now have all the sessions from every stream