Snowflake cloud data platform 雪花窗分析功能,用于设置分组集
我为data lake设置了以下数据集,该数据集作为维度的源,我希望在维度中迁移历史数据 例如: 请注意,datalake表有多个列,这些列不属于维度的一部分,因此我们正在重新计算“显示相同的值,但显示datefrom和dateto”的检查 数据显示为Snowflake cloud data platform 雪花窗分析功能,用于设置分组集,snowflake-cloud-data-platform,snowflake-schema,Snowflake Cloud Data Platform,Snowflake Schema,我为data lake设置了以下数据集,该数据集作为维度的源,我希望在维度中迁移历史数据 例如: 请注意,datalake表有多个列,这些列不属于维度的一部分,因此我们正在重新计算“显示相同的值,但显示datefrom和dateto”的检查 数据显示为 Primarykey Checksum DateFrom Dateto 1 11 01:00 12/31/999 1
Primarykey Checksum DateFrom Dateto
1 11 01:00 12/31/999
1 22 03:00 07:00
但预期的结果是
Primarykey Checksum DateFrom Dateto
1 11 01:00 03:00
1 22 03:00 07:00
1 11 07:00 12/31/999
我尝试在一个查询中使用多种方法,但有什么好的建议吗 之所以只得到两行,是因为分区中有两列Primarykey和checksum,而这两列只有两种组合。预期输出中需要的行与预期输出中的第一行具有相同的Primarykey和校验和1,11 我在您的数据中看到,如果您将ActiveFlag包含到分区中,将得到结果
WITH base AS (
SELECT
primary_key,
checksum,
FIRST_VALUE (datefrom) OVER ( PARTITION BY primary_key, checksum, active_flag order by datefrom) AS datefrom,
LAST_VALUE (dateto) OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS dateto,
ROWNUMBER () OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS latest_record
FROM Datalake.user
)
SELECT * FROM base WHERE latest_record = 1
试试这个代码。应在Snowflake和Oracle中工作: 如果校验和按日期更改顺序,则创建一个单独的组
**SNOWFLAKE**:
WITH base AS (
SELECT
Primarykey,
checksum,
FIRST_VALUE( datefrom ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS Datefrom,
LAST_VALUE( dateto ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS Dateto,
ROW_NUMBER() over ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS latest_record
FROM(
SELECT
Primarykey,
checksum,
checksum_prev,
datefrom,
dateto,
LAST_VALUE((case when checksum<>checksum_prev THEN group1 END)) IGNORE NULLS OVER (
ORDER BY group1
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) checksum_group
FROM (
SELECT
Primarykey,
checksum,
datefrom,
dateto,
LAG(checksum, 1, 0) OVER (ORDER BY datefrom) AS checksum_prev,
LPAD(1000 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), 4, 0) as group1
FROM Datalake.user)
)
)
SELECT * FROM base WHERE latest_record = 1
**Oracle**:
WITH base AS (
SELECT
Primarykey,
checksum,
FIRST_VALUE ( datefrom ) OVER ( partition by Primarykey ,checksum,checksum_group order by datefrom ) AS Datefrom,
LAST_VALUE ( dateto ) OVER ( partition by Primarykey ,checksum,checksum_group order by datefrom ) AS Dateto,
ROW_NUMBER() OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS latest_record
FROM(
SELECT
Primarykey,
checksum,
checksum_prev,
datefrom,
dateto,
LAST_VALUE((CASE WHEN checksum<>checksum_prev THEN group1 END)) IGNORE NULLS
OVER (ORDER BY group1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) checksum_group
FROM (
SELECT
Primarykey,
checksum,
datefrom,
dateto,
LAG(checksum, 1, 0) OVER (ORDER BY DATEFROM) AS checksum_prev,
LPAD(1000 + ROWNUM, 4, 0) as group1
FROM Datalake.user)))
SELECT * FROM base WHERE latest_record = 1
我调整了查询,使其可以在整个数据集上工作。 由于缺少主键,整个数据都失败了。 修改的工作查询
这将不起作用,因为活动标志可能为false,下一条记录可能为True并插入不同的主键。@snowflakeuser您无法使用有限的信息获得所需的答案。我没有说它会起作用,尽管你没有提到,但是数据显示它会起作用。归根结底,这是您要解决的问题,我们可以帮助您了解这些工具的工作原理,或者展示您所描述的似乎适合任务的想法。当我使用整个数据集运行时,结果不会累加。这是由于计算了组值。你能提供一些这种情况的例子吗
WITH base AS (
SELECT
primary_key,
checksum,
FIRST_VALUE (datefrom) OVER ( PARTITION BY primary_key, checksum, active_flag order by datefrom) AS datefrom,
LAST_VALUE (dateto) OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS dateto,
ROWNUMBER () OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS latest_record
FROM Datalake.user
)
SELECT * FROM base WHERE latest_record = 1
**SNOWFLAKE**:
WITH base AS (
SELECT
Primarykey,
checksum,
FIRST_VALUE( datefrom ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS Datefrom,
LAST_VALUE( dateto ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS Dateto,
ROW_NUMBER() over ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS latest_record
FROM(
SELECT
Primarykey,
checksum,
checksum_prev,
datefrom,
dateto,
LAST_VALUE((case when checksum<>checksum_prev THEN group1 END)) IGNORE NULLS OVER (
ORDER BY group1
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) checksum_group
FROM (
SELECT
Primarykey,
checksum,
datefrom,
dateto,
LAG(checksum, 1, 0) OVER (ORDER BY datefrom) AS checksum_prev,
LPAD(1000 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), 4, 0) as group1
FROM Datalake.user)
)
)
SELECT * FROM base WHERE latest_record = 1
**Oracle**:
WITH base AS (
SELECT
Primarykey,
checksum,
FIRST_VALUE ( datefrom ) OVER ( partition by Primarykey ,checksum,checksum_group order by datefrom ) AS Datefrom,
LAST_VALUE ( dateto ) OVER ( partition by Primarykey ,checksum,checksum_group order by datefrom ) AS Dateto,
ROW_NUMBER() OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS latest_record
FROM(
SELECT
Primarykey,
checksum,
checksum_prev,
datefrom,
dateto,
LAST_VALUE((CASE WHEN checksum<>checksum_prev THEN group1 END)) IGNORE NULLS
OVER (ORDER BY group1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) checksum_group
FROM (
SELECT
Primarykey,
checksum,
datefrom,
dateto,
LAG(checksum, 1, 0) OVER (ORDER BY DATEFROM) AS checksum_prev,
LPAD(1000 + ROWNUM, 4, 0) as group1
FROM Datalake.user)))
SELECT * FROM base WHERE latest_record = 1