SQL-如何根据子组中的其他值对子组中的记录进行有效分类?
我有一张如下所示的桌子。记录按用户id和事件时间排序SQL-如何根据子组中的其他值对子组中的记录进行有效分类?,sql,subquery,correlated-subquery,Sql,Subquery,Correlated Subquery,我有一张如下所示的桌子。记录按用户id和事件时间排序 Row User_ID Event_Time Event_Type 1 1 2020-01-01 View 2 1 2020-01-02 Click 3 1 2020-01-03 Purchase 4 2 2020-02-01 View 5 2 2020-
Row User_ID Event_Time Event_Type
1 1 2020-01-01 View
2 1 2020-01-02 Click
3 1 2020-01-03 Purchase
4 2 2020-02-01 View
5 2 2020-02-02 Click
6 2 2020-02-03 View
7 2 2020-02-04 Purchase
8 2 2020-02-11 View
9 2 2020-02-12 Purchase
10 2 2020-02-21 View
11 2 2020-02-22 Click
12 2 2020-02-23 Purchase
13 2 2020-02-27 View
14 2 2020-02-28 Click
15 3 2020-03-01 View
16 3 2020-03-02 Purchase
...
我想添加一个名为Path的新列来对非购买事件进行分类。用户的每个非购买事件都“属于”同一用户的即时购买事件,即发生在该事件之后的事件,这意味着可以将其视为一个子组。在每个分组中:
- 第一个非购买事件是介绍人(第1、4、10行)
- 最后一个非购买事件是关闭的(第2、6、11行)
- 介绍人和关系密切者之间的所有非购买事件都是影响者(第5行)
- 如果一个采购事件只有一个与之分组的非采购事件,则该非采购事件仅为(第8行,第15行)
- 在采购事件中填写NULL(第3、7、9、12、16行)
- 如果非采购事件不属于任何采购事件(第13、14行),则填写NULL
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase NULL
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase NULL
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase NULL
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase NULL
13 2 2020-02-27 View NULL
14 2 2020-02-28 Click NULL
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase NULL
...
如果我进行自我加入并添加一个新列来帮助识别用户上次购买每个活动的时间,那么解决方案就很简单了。但是,我有超过1亿条记录,并且自加入不够有效。行刑最终将暂停。所以我的问题是,有没有更有效的方法来添加这个新专栏?我正在考虑使用关联查询,但似乎对此束手无策。如果您使用的是支持窗口功能的DBMS,您可以使用两个CTE,首先将行拆分为不同的购买,然后找到与每个购买相关的行号,然后最后根据您给定的条件计算路径:
WITH purchases AS (
SELECT "Row", User_ID, Event_Time, Event_Type,
COALESCE(SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) AS pnum
FROM events
),
prows AS (
SELECT "Row", User_ID, Event_Time, Event_Type, pnum,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time) AS rn,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time DESC) AS drn
FROM purchases
)
SELECT "Row", User_ID, Event_Time, Event_Type,
CASE WHEN Event_Type = 'Purchase' OR
NOT EXISTS (SELECT *
FROM prows r2
WHERE r2.User_ID = r1.User_ID
AND r2.pnum = r1.pnum
AND r2.Event_Type = 'Purchase') THEN NULL
WHEN rn = 1 AND drn = 2 THEN 'Only'
WHEN rn = 1 THEN 'Introducer'
WHEN drn = 2 THEN 'Closer'
ELSE 'Influencer'
END AS Path
FROM prows r1
ORDER BY User_ID, Event_Time
输出:
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase (null)
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase (null)
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase (null)
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase (null)
13 2 2020-02-27 View (null)
14 2 2020-02-28 Click (null)
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase (null)
。同样的查询也将在PostgreSQL和Oracle上运行。如果您使用的是支持窗口功能的DBMS,您可以使用两个CTE首先将行拆分为不同的购买,然后找到与这些购买相关的行号,最后根据您给定的条件计算
路径
:
WITH purchases AS (
SELECT "Row", User_ID, Event_Time, Event_Type,
COALESCE(SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) AS pnum
FROM events
),
prows AS (
SELECT "Row", User_ID, Event_Time, Event_Type, pnum,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time) AS rn,
ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time DESC) AS drn
FROM purchases
)
SELECT "Row", User_ID, Event_Time, Event_Type,
CASE WHEN Event_Type = 'Purchase' OR
NOT EXISTS (SELECT *
FROM prows r2
WHERE r2.User_ID = r1.User_ID
AND r2.pnum = r1.pnum
AND r2.Event_Type = 'Purchase') THEN NULL
WHEN rn = 1 AND drn = 2 THEN 'Only'
WHEN rn = 1 THEN 'Introducer'
WHEN drn = 2 THEN 'Closer'
ELSE 'Influencer'
END AS Path
FROM prows r1
ORDER BY User_ID, Event_Time
输出:
Row User_ID Event_Time Event_Type Path
1 1 2020-01-01 View Introducer
2 1 2020-01-02 Click Closer
3 1 2020-01-03 Purchase (null)
4 2 2020-02-01 View Introducer
5 2 2020-02-02 Click Influencer
6 2 2020-02-03 View Closer
7 2 2020-02-04 Purchase (null)
8 2 2020-02-11 View Only
9 2 2020-02-12 Purchase (null)
10 2 2020-02-21 View Introducer
11 2 2020-02-22 Click Closer
12 2 2020-02-23 Purchase (null)
13 2 2020-02-27 View (null)
14 2 2020-02-28 Click (null)
15 3 2020-03-01 View Only
16 3 2020-03-02 Purchase (null)
。同样的查询也将在PostgreSQL和Oracle上运行。这与Nick的方法类似,但我认为逻辑更简单:
WITH e AS (
SELECT e.*,
SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time DESC) AS grp
FROM events e
),
en as (
SELECT e.*,
COUNT(*) OVER (PARTITION BY user_id, grp) as cnt,
ROW_NUMBER() OVER (PARTITION BY user_id, grp ORDER BY Event_Time) as seqnum
FROM e
)
SELECT en.*,
(CASE WHEN grp = 0 -- no purchase event
THEN NULL
WHEN Event_Type = 'Purchase' -- the event itself
THEN NULL
WHEN seqnum = 1 AND cnt = 2 -- the special case of "ONLY"
THEN 'Only'
WHEN seqnum = 1 -- The first event
THEN 'Introducer'
WHEN seqnum = cnt - 1 -- The penultimate event
THEN 'Closer'
ELSE 'Influencer'
END) as Path
FROM en
ORDER BY User_ID, Event_Time;
特别是,外部查询中的子查询是不必要的。grp=0
查找可能没有购买的最后一组事件。我还认为,根据事件总数和顺序计数器编写逻辑更容易
这与Nick的做法类似,但我认为逻辑更简单:
WITH e AS (
SELECT e.*,
SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
(PARTITION BY User_ID ORDER BY Event_Time DESC) AS grp
FROM events e
),
en as (
SELECT e.*,
COUNT(*) OVER (PARTITION BY user_id, grp) as cnt,
ROW_NUMBER() OVER (PARTITION BY user_id, grp ORDER BY Event_Time) as seqnum
FROM e
)
SELECT en.*,
(CASE WHEN grp = 0 -- no purchase event
THEN NULL
WHEN Event_Type = 'Purchase' -- the event itself
THEN NULL
WHEN seqnum = 1 AND cnt = 2 -- the special case of "ONLY"
THEN 'Only'
WHEN seqnum = 1 -- The first event
THEN 'Introducer'
WHEN seqnum = cnt - 1 -- The penultimate event
THEN 'Closer'
ELSE 'Influencer'
END) as Path
FROM en
ORDER BY User_ID, Event_Time;
特别是,外部查询中的子查询是不必要的。grp=0
查找可能没有购买的最后一组事件。我还认为,根据事件总数和顺序计数器编写逻辑更容易
是DBFIDLE。您使用的是哪种DBMS?请适当地标记您的问题。对不起,我的错误!我在Google BigQueryYou使用的是哪种数据库管理系统?请适当地标记您的问题。对不起,我的错误!我在Google BigQueryTanks Gordon上运行这个!这更容易理解。我将测试它,看看它是否有效!当你看到它被写下来的时候,这是很明显的,但是我从来没有想到过向后排序。这无疑使这个问题变得容易多了。@Nick。我反复讨论是否回答,因为你的说法是合理的。最后,子查询提示我回答。@GordonLinoff任何人都不应该因为一个“合理”的答案而受苦!:-)我确信丢失子查询将大大提高性能。谢谢Gordon!这更容易理解。我将测试它,看看它是否有效!当你看到它被写下来的时候,这是很明显的,但是我从来没有想到过向后排序。这无疑使这个问题变得容易多了。@Nick。我反复讨论是否回答,因为你的说法是合理的。最后,子查询提示我回答。@GordonLinoff任何人都不应该因为一个“合理”的答案而受苦!:-)我确信丢失子查询将大大提高性能。