Snowflake cloud data platform 如何优化与雪花上中等规模的II型表的联接?

Snowflake cloud data platform 如何优化与雪花上中等规模的II型表的联接?,snowflake-cloud-data-platform,Snowflake Cloud Data Platform,背景 假设我有以下表格: -- 33M rows CREATE TABLE lkp.session ( session_id BIGINT, visitor_id BIGINT, session_datetime TIMESTAMP ); -- 17M rows CREATE TABLE lkp.visitor_customer_hist ( visitor_id BIGINT, customer_id BIGINT, from_datetime

背景

假设我有以下表格:

-- 33M rows
CREATE TABLE lkp.session (
    session_id BIGINT,
    visitor_id BIGINT,
    session_datetime TIMESTAMP
);

-- 17M rows
CREATE TABLE lkp.visitor_customer_hist (
    visitor_id BIGINT,
    customer_id BIGINT,
    from_datetime TIMESTAMP,
    to_datetime TIMESTAMP
);
Visitor\u customer\u hist提供在每个时间点对每个访问者有效的客户id

目标是使用visitor_id和session_datetime查找每个会话有效的客户id

CREATE TABLE lkp.session_effective_customer AS
    SELECT
        s.session_id,
        vch.customer_id AS effective_customer_id
    FROM lkp.session s
    JOIN lkp.visitor_customer_hist vch ON vch.visitor_id = s.visitor_id
        AND s.session_datetime >= vch.from_datetime
        AND s.session_datetime < vch.to_datetime;
问题

即使仓库规模很大,这个查询也非常慢。完成这项工作花了1h15m的时间,这是仓库上运行的唯一查询

我验证了visitor_customer_hist中没有重叠的值,存在这些值可能会导致重复连接

雪花真的不擅长这种连接吗?我正在寻找关于如何为此类查询优化表、重新聚类或任何优化技术或查询的重新工作(例如,可能是相关子查询或其他)的建议

附加信息

简介:

如果lkp.session表包含较窄的时间范围,而lkp.visitor\u customer\u hist表包含较宽的时间范围,则可以通过重写查询来添加限制联接中考虑的行范围的冗余条件:

CREATE TABLE lkp.session_effective_customer AS
SELECT
    s.session_id,
    vch.customer_id AS effective_customer_id
FROM lkp.session s
JOIN lkp.visitor_customer_hist vch ON vch.visitor_id = s.visitor_id
    AND s.session_datetime >= vch.from_datetime
    AND s.session_datetime < vch.to_datetime
WHERE vch.to_datetime >= (select min(session_datetime) from lkp.session)
    AND  vch.from_datetime <= (select max(session_datetime) from lkp.session);
另一方面,如果两个表涵盖的日期范围相似,并且随着时间的推移,与给定访客相关的客户数量也很大,那么这将不会有多大帮助。

下面,我们可以通过查看访客方面的最小值和最大值来对其进行过滤。如下所示:

CREATE TEMPORARY TABLE _vch AS
    SELECT
        l.visitor_id,
        l.customer_id,
        l.from_datetime,
        l.to_datetime
    FROM (
             SELECT
                 l.visitor_id,
                 min(l.session_datetime) AS mindt,
                 max(l.session_datetime) AS maxdt
             FROM lkp.session l
             GROUP BY l.visitor_id
         ) a
    JOIN lkp.visitor_customer_hist l ON a.visitor_id = l.visitor_id
        AND l.from_datetime >= a.mindt
        AND l.to_datetime <= a.maxdt;
那么,我们的重量更轻的hist表,也许我们会有更好的运气:

CREATE TABLE lkp.session_effective_customer AS
    SELECT
        s.session_id,
        vch.customer_id AS effective_customer_id
    FROM lkp.session s
    JOIN _vch vch ON vch.visitor_id = s.visitor_id
        AND s.session_local_datetime >= vch.from_datetime
        AND s.session_local_datetime < vch.to_datetime;
不幸的是,在我的例子中,虽然我过滤掉了很大比例的行,但问题访客(visitor_customer_hist中有数千条记录的访客)仍然存在问题,即他们仍然有数千条记录,导致加入爆炸


但是,在其他情况下,这可能会起作用。

在两个表的每个访问者记录数都很高的情况下,由于注释中描述的原因,这种连接是有问题的。因此,在这种情况下,最好尽可能避免这种连接

我最终解决这个问题的方法是废弃visitor\u customer\u hist表并编写一个自定义窗口函数/udtf

最初,我创建了lkp.visitor\u customer\u hist表,因为它可以使用现有的窗口函数创建,并且可以在非MPP sql数据库上创建适当的索引,这将使查找具有足够的性能。它是这样创建的:

CREATE TABLE lkp.visitor_customer_hist AS
    SELECT
        a.visitor_id AS visitor_id,
        a.customer_id AS customer_id,
        nvl(lag(a.session_datetime) OVER ( PARTITION BY a.visitor_id
            ORDER BY a.session_datetime ), '1900-01-01') AS from_datetime,
        CASE WHEN lead(a.session_datetime) OVER ( PARTITION BY a.visitor_id
            ORDER BY a.session_datetime ) IS NULL THEN '9999-12-31'
        ELSE a.session_datetime END AS to_datetime
    FROM (
             SELECT
                 s.session_id,
                 vs.visitor_id,
                 customer_id,
                 row_number() OVER ( PARTITION BY vs.visitor_id, s.session_datetime
                     ORDER BY s.session_id ) AS rn,
                 lead(s.customer_id) OVER ( PARTITION BY vs.visitor_id
                     ORDER BY s.session_datetime ) AS next_cust_id,
                 session_datetime
             FROM "session" s
             JOIN "visitor_session" vs ON vs.session_id = s.session_id
             WHERE s.customer_id <> -2
         ) a
    WHERE (a.next_cust_id <> a.customer_id
        OR a.next_cust_id IS NULL) AND a.rn = 1;
SELECT
    iff(a.customer_id <> -1, a.customer_id, ec.effective_customer_id) AS customer_id,
    a.session_id
FROM "session" a
JOIN table(udtf_eff_customer(nvl2(a.visitor_id, a.customer_id, NULL) :: DOUBLE) OVER ( PARTITION BY a.visitor_id
    ORDER BY a.session_datetime DESC )) ec
它可以这样应用:

CREATE TABLE lkp.visitor_customer_hist AS
    SELECT
        a.visitor_id AS visitor_id,
        a.customer_id AS customer_id,
        nvl(lag(a.session_datetime) OVER ( PARTITION BY a.visitor_id
            ORDER BY a.session_datetime ), '1900-01-01') AS from_datetime,
        CASE WHEN lead(a.session_datetime) OVER ( PARTITION BY a.visitor_id
            ORDER BY a.session_datetime ) IS NULL THEN '9999-12-31'
        ELSE a.session_datetime END AS to_datetime
    FROM (
             SELECT
                 s.session_id,
                 vs.visitor_id,
                 customer_id,
                 row_number() OVER ( PARTITION BY vs.visitor_id, s.session_datetime
                     ORDER BY s.session_id ) AS rn,
                 lead(s.customer_id) OVER ( PARTITION BY vs.visitor_id
                     ORDER BY s.session_datetime ) AS next_cust_id,
                 session_datetime
             FROM "session" s
             JOIN "visitor_session" vs ON vs.session_id = s.session_id
             WHERE s.customer_id <> -2
         ) a
    WHERE (a.next_cust_id <> a.customer_id
        OR a.next_cust_id IS NULL) AND a.rn = 1;
SELECT
    iff(a.customer_id <> -1, a.customer_id, ec.effective_customer_id) AS customer_id,
    a.session_id
FROM "session" a
JOIN table(udtf_eff_customer(nvl2(a.visitor_id, a.customer_id, NULL) :: DOUBLE) OVER ( PARTITION BY a.visitor_id
    ORDER BY a.session_datetime DESC )) ec
因此,这实现了期望的结果:对于每个会话,如果客户id不是未知的,那么我们继续使用它;否则,我们将使用下一个customer_id(如果存在),该id可在会话时与该访客订单关联


这是一个比创建查找表更好的解决方案;它基本上只需要一次数据传递,所需的代码/复杂度要低得多,而且速度非常快

这似乎很长。我想知道每个访问者在访问者\u客户\u历史记录中有多少行-可能会发生爆炸。作为线索,运行以下程序需要多长时间?结果是什么?选择count*FROM lkp.session s JOIN lkp.visitor\u customer\u hist vch ON vch.visitor\u id=s.visitor\u idI希望您能看到这个!重新联接爆炸,我们可以从插入中看到,联接没有复制行。。。但是98%的访客都有,我会尝试排除超过1000条记录的访客,看看它是否会更快…这绝对是出乎意料的。我会联系Snowflake支持,提供准确的查询id,他们可能会提供更多帮助。此外,@chorbs所说的可能会有所帮助-即使没有重叠的值,连接在visitor_id上是快速的,因为它是相等的,然后在时间范围上是缓慢的。例如,如果两个表中都有100000个相同访问者id的实例,则会导致连接内部发生严重爆炸。