Google bigquery 在BigQuery中筛选嵌套数组中的范围,并对结果进行重复数据消除
我在玩BigQuery和嵌套表,SQL不是我的强项。我正试图解决实际生产数据的实际问题,同时试图将一些SQL/BQ概念引入我的头脑 我的查询与页面上的一些内容相似,但相似性对我来说还不够 让我举一些与我的真实数据结构非常相似的示例数据,然后描述我需要从中得到什么 基本上,我有两个表,我想用一个来过滤另一个 表1有一些两级嵌套,可以这样构建:Google bigquery 在BigQuery中筛选嵌套数组中的范围,并对结果进行重复数据消除,google-bigquery,Google Bigquery,我在玩BigQuery和嵌套表,SQL不是我的强项。我正试图解决实际生产数据的实际问题,同时试图将一些SQL/BQ概念引入我的头脑 我的查询与页面上的一些内容相似,但相似性对我来说还不够 让我举一些与我的真实数据结构非常相似的示例数据,然后描述我需要从中得到什么 基本上,我有两个表,我想用一个来过滤另一个 表1有一些两级嵌套,可以这样构建: WITH data AS ( SELECT "Test 1" AS name, [STRUCT(1 AS id, [20, 21] AS resul
WITH data AS (
SELECT "Test 1" AS name, [STRUCT(1 AS id, [20, 21] AS results), STRUCT(2 AS id, [22, 23] AS results)] AS resultset
UNION ALL
SELECT "Test 2" AS name, [STRUCT(1 AS id, [23, 24] AS results), STRUCT(2 AS id, [25, 26] AS results)] AS resultset
UNION ALL
SELECT "Test 3" AS name, [STRUCT(1 AS id, [26, 27] AS results), STRUCT(2 AS id, [28, 29] AS results)] AS resultset
)
SELECT * FROM data
ranges AS (
SELECT "Range 1" AS title, 24.0 AS min, 25.0 AS max
UNION ALL
SELECT "Range 2" AS title, 26.0 AS min, 27.0 AS max
)
SELECT * from ranges
这些数字的含义无关紧要。重要的是,表2包含我要用于筛选表1的范围。表2可以这样构建:
WITH data AS (
SELECT "Test 1" AS name, [STRUCT(1 AS id, [20, 21] AS results), STRUCT(2 AS id, [22, 23] AS results)] AS resultset
UNION ALL
SELECT "Test 2" AS name, [STRUCT(1 AS id, [23, 24] AS results), STRUCT(2 AS id, [25, 26] AS results)] AS resultset
UNION ALL
SELECT "Test 3" AS name, [STRUCT(1 AS id, [26, 27] AS results), STRUCT(2 AS id, [28, 29] AS results)] AS resultset
)
SELECT * FROM data
ranges AS (
SELECT "Range 1" AS title, 24.0 AS min, 25.0 AS max
UNION ALL
SELECT "Range 2" AS title, 26.0 AS min, 27.0 AS max
)
SELECT * from ranges
我想要结束的是第一个表中的行,其中any结果与第二个表中的一个或多个范围匹配,但没有一行不匹配
我知道我可以对两个表进行一些不必要的调整和合并,以得到一个经过筛选的结果,但由于不必要,结果将包含重复项:
WITH data AS (
SELECT "Test 1" as name, [STRUCT(1 as id, [20, 21] as results), STRUCT(2 as id, [22, 23] as results)] as resultset
UNION ALL
SELECT "Test 2" as name, [STRUCT(1 as id, [23, 24] as results), STRUCT(2 as id, [25, 26] as results)] as resultset
UNION ALL
SELECT "Test 3" as name, [STRUCT(1 as id, [26, 27] as results), STRUCT(2 as id, [28, 29] as results)] as resultset
),
ranges AS (
SELECT "Range 1" AS title, 24.0 as min, 25.0 as max
UNION ALL
SELECT "Range 2" AS title, 26.0 as min, 27.0 as max
)
SELECT data.*
FROM data, UNNEST(resultset), UNNEST(results) r
JOIN ranges
ON r BETWEEN min AND max
这就是我所拥有的:
Row name resultset.id resultset.results
1 Test 2 1 23
24
2 25
26
2 Test 2 1 23
24
2 25
26
3 Test 2 1 23
24
2 25
26
4 Test 3 1 26
27
2 28
29
5 Test 3 1 26
27
2 28
29
我要做的是在SELECT中调用DISTINCT data.*以将其缩减为两个唯一的行,并对其进行处理
换句话说,这就是我想要的:
Row name resultset.id resultset.results
1 Test 2 1 23
24
2 25
26
2 Test 3 1 26
27
2 28
29
但我不能用嵌套数据来实现这一点
因此,我有两个问题:
关于数据:我无法更改第一个表。第二个表我可以使用,如果它能带来一个简单的解决方案。尝试从数据集中选择所需的数据。此查询返回唯一但未列出的结果:
SELECT data.name, rs.id, r
FROM data
left join UNNEST(resultset) rs
left join UNNEST(results) as r
JOIN ranges ON r BETWEEN min AND max
下面是BigQuery标准SQL 最简单的解决方案是(实际上不改变您已有的查询的核心)添加GROUPBY,如下所示
#standardSQL
SELECT ANY_VALUE(data).*
FROM data, UNNEST(resultset), UNNEST(results) r
JOIN ranges ON r BETWEEN min AND max
GROUP BY TO_JSON_STRING(data)
这管用!但我不明白为什么。你能详细说明一下吗
当然
选择不同的。。。从…
概念上等同于选择。。。分组依据
因此,任务是为GROUPBY和相应的聚合函数(GROUPBY要求)找到适当的值
ANY_VALUE
和TO_JSON_STRING(data)
是我们在这里需要的这可能是一种最终的方法,但它在原始条件下失败:如果任何子子值与表2中的任何范围匹配,我希望整行输出:“我希望得到的是第一个表中的行,其中任何结果都与第二个表中的一个或多个范围相匹配“您应该将预期结果添加到问题中。使工作更容易。哦,哇;这管用!但我不明白为什么。你能详细说明一下吗?另外,在我尝试将其应用到实际工作流程之前,这种方法是否可以扩展?我通常会有数亿行。当然,答案是:太棒了;谢谢因此,从这个问题上退一步:我是“正确地”解决了这个问题,还是有其他/更好的方法来实现相同的结果?您的方法已经足够好了,所以我故意没有改变它,而是专注于您的主要问题-即重复输出-但几乎总是有更好的方法o)