Snowflake cloud data platform 雪花窗分析功能,用于设置分组集

Snowflake cloud data platform 雪花窗分析功能,用于设置分组集,snowflake-cloud-data-platform,snowflake-schema,Snowflake Cloud Data Platform,Snowflake Schema,我为data lake设置了以下数据集,该数据集作为维度的源,我希望在维度中迁移历史数据 例如: 请注意,datalake表有多个列,这些列不属于维度的一部分,因此我们正在重新计算“显示相同的值,但显示datefrom和dateto”的检查 数据显示为 Primarykey Checksum DateFrom Dateto 1 11 01:00 12/31/999 1

我为data lake设置了以下数据集,该数据集作为维度的源,我希望在维度中迁移历史数据

例如:

请注意,datalake表有多个列,这些列不属于维度的一部分,因此我们正在重新计算“显示相同的值,但显示datefrom和dateto”的检查

数据显示为

Primarykey       Checksum     DateFrom     Dateto 
   1              11           01:00         12/31/999 
   1              22           03:00         07:00
但预期的结果是

Primarykey       Checksum     DateFrom     Dateto 
   1              11           01:00         03:00 
   1              22           03:00         07:00
   1              11           07:00         12/31/999 

我尝试在一个查询中使用多种方法,但有什么好的建议吗

之所以只得到两行,是因为分区中有两列Primarykey和checksum,而这两列只有两种组合。预期输出中需要的行与预期输出中的第一行具有相同的Primarykey和校验和1,11

我在您的数据中看到,如果您将ActiveFlag包含到分区中,将得到结果

WITH base AS (
    SELECT 
       primary_key,
       checksum,
       FIRST_VALUE (datefrom) OVER ( PARTITION BY primary_key, checksum, active_flag order by datefrom) AS datefrom,
       LAST_VALUE (dateto) OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS dateto,
       ROWNUMBER () OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS latest_record 
    FROM Datalake.user
)
SELECT * FROM base WHERE latest_record = 1

试试这个代码。应在Snowflake和Oracle中工作: 如果校验和按日期更改顺序,则创建一个单独的组

**SNOWFLAKE**:
WITH base AS (
SELECT 
Primarykey,
   checksum,
   FIRST_VALUE( datefrom ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group     ORDER BY datefrom ) AS Datefrom,
   LAST_VALUE( dateto ) OVER ( PARTITION BY Primarykey  ,checksum,checksum_group     ORDER BY datefrom ) AS Dateto,
   ROW_NUMBER() over ( PARTITION BY Primarykey  ,checksum,checksum_group ORDER BY     datefrom ) AS latest_record 
FROM(   
SELECT 
Primarykey,
   checksum,
   checksum_prev,
   datefrom,
   dateto,
   LAST_VALUE((case when checksum<>checksum_prev THEN group1 END)) IGNORE NULLS OVER     (
  ORDER BY group1
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) checksum_group
 FROM (
SELECT 
   Primarykey,
   checksum,
   datefrom,
   dateto,
   LAG(checksum, 1, 0) OVER (ORDER BY datefrom) AS checksum_prev,
   LPAD(1000 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), 4, 0) as group1
FROM Datalake.user)
)
) 
SELECT * FROM base WHERE latest_record = 1

**Oracle**:
WITH base AS (
SELECT 
Primarykey,
   checksum,
   FIRST_VALUE ( datefrom ) OVER ( partition by Primarykey ,checksum,checksum_group     order by datefrom ) AS Datefrom,
   LAST_VALUE ( dateto ) OVER ( partition by Primarykey  ,checksum,checksum_group     order by datefrom ) AS Dateto,
   ROW_NUMBER() OVER ( PARTITION BY Primarykey  ,checksum,checksum_group ORDER BY     datefrom ) AS latest_record 
FROM(   
SELECT 
Primarykey,
   checksum,
   checksum_prev,
   datefrom,
   dateto,
   LAST_VALUE((CASE WHEN checksum<>checksum_prev THEN group1 END)) IGNORE NULLS 
   OVER (ORDER BY group1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)     checksum_group
 FROM (
SELECT 
   Primarykey,
   checksum,
   datefrom,
   dateto,
   LAG(checksum, 1, 0) OVER (ORDER BY DATEFROM) AS checksum_prev,
   LPAD(1000 + ROWNUM, 4, 0) as group1
FROM Datalake.user))) 
SELECT * FROM base WHERE latest_record = 1

我调整了查询,使其可以在整个数据集上工作。 由于缺少主键,整个数据都失败了。 修改的工作查询


这将不起作用,因为活动标志可能为false,下一条记录可能为True并插入不同的主键。@snowflakeuser您无法使用有限的信息获得所需的答案。我没有说它会起作用,尽管你没有提到,但是数据显示它会起作用。归根结底,这是您要解决的问题,我们可以帮助您了解这些工具的工作原理,或者展示您所描述的似乎适合任务的想法。当我使用整个数据集运行时,结果不会累加。这是由于计算了组值。你能提供一些这种情况的例子吗
WITH base AS (
    SELECT 
       primary_key,
       checksum,
       FIRST_VALUE (datefrom) OVER ( PARTITION BY primary_key, checksum, active_flag order by datefrom) AS datefrom,
       LAST_VALUE (dateto) OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS dateto,
       ROWNUMBER () OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS latest_record 
    FROM Datalake.user
)
SELECT * FROM base WHERE latest_record = 1
**SNOWFLAKE**:
WITH base AS (
SELECT 
Primarykey,
   checksum,
   FIRST_VALUE( datefrom ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group     ORDER BY datefrom ) AS Datefrom,
   LAST_VALUE( dateto ) OVER ( PARTITION BY Primarykey  ,checksum,checksum_group     ORDER BY datefrom ) AS Dateto,
   ROW_NUMBER() over ( PARTITION BY Primarykey  ,checksum,checksum_group ORDER BY     datefrom ) AS latest_record 
FROM(   
SELECT 
Primarykey,
   checksum,
   checksum_prev,
   datefrom,
   dateto,
   LAST_VALUE((case when checksum<>checksum_prev THEN group1 END)) IGNORE NULLS OVER     (
  ORDER BY group1
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) checksum_group
 FROM (
SELECT 
   Primarykey,
   checksum,
   datefrom,
   dateto,
   LAG(checksum, 1, 0) OVER (ORDER BY datefrom) AS checksum_prev,
   LPAD(1000 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), 4, 0) as group1
FROM Datalake.user)
)
) 
SELECT * FROM base WHERE latest_record = 1

**Oracle**:
WITH base AS (
SELECT 
Primarykey,
   checksum,
   FIRST_VALUE ( datefrom ) OVER ( partition by Primarykey ,checksum,checksum_group     order by datefrom ) AS Datefrom,
   LAST_VALUE ( dateto ) OVER ( partition by Primarykey  ,checksum,checksum_group     order by datefrom ) AS Dateto,
   ROW_NUMBER() OVER ( PARTITION BY Primarykey  ,checksum,checksum_group ORDER BY     datefrom ) AS latest_record 
FROM(   
SELECT 
Primarykey,
   checksum,
   checksum_prev,
   datefrom,
   dateto,
   LAST_VALUE((CASE WHEN checksum<>checksum_prev THEN group1 END)) IGNORE NULLS 
   OVER (ORDER BY group1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)     checksum_group
 FROM (
SELECT 
   Primarykey,
   checksum,
   datefrom,
   dateto,
   LAG(checksum, 1, 0) OVER (ORDER BY DATEFROM) AS checksum_prev,
   LPAD(1000 + ROWNUM, 4, 0) as group1
FROM Datalake.user))) 
SELECT * FROM base WHERE latest_record = 1