Google bigquery 在BigQuery中筛选嵌套数组中的范围,并对结果进行重复数据消除

Google bigquery 在BigQuery中筛选嵌套数组中的范围,并对结果进行重复数据消除,google-bigquery,Google Bigquery,我在玩BigQuery和嵌套表,SQL不是我的强项。我正试图解决实际生产数据的实际问题,同时试图将一些SQL/BQ概念引入我的头脑 我的查询与页面上的一些内容相似,但相似性对我来说还不够 让我举一些与我的真实数据结构非常相似的示例数据,然后描述我需要从中得到什么 基本上,我有两个表,我想用一个来过滤另一个 表1有一些两级嵌套,可以这样构建: WITH data AS ( SELECT "Test 1" AS name, [STRUCT(1 AS id, [20, 21] AS resul

我在玩BigQuery和嵌套表,SQL不是我的强项。我正试图解决实际生产数据的实际问题,同时试图将一些SQL/BQ概念引入我的头脑

我的查询与页面上的一些内容相似,但相似性对我来说还不够

让我举一些与我的真实数据结构非常相似的示例数据,然后描述我需要从中得到什么

基本上,我有两个表,我想用一个来过滤另一个

表1有一些两级嵌套,可以这样构建:

WITH data AS (
    SELECT "Test 1" AS name, [STRUCT(1 AS id, [20, 21] AS results), STRUCT(2 AS id, [22, 23] AS results)] AS resultset
    UNION ALL
    SELECT "Test 2" AS name, [STRUCT(1 AS id, [23, 24] AS results), STRUCT(2 AS id, [25, 26] AS results)] AS resultset
    UNION ALL
    SELECT "Test 3" AS name, [STRUCT(1 AS id, [26, 27] AS results), STRUCT(2 AS id, [28, 29] AS results)] AS resultset
)
SELECT * FROM data
ranges AS (
    SELECT "Range 1" AS title, 24.0 AS min, 25.0 AS max
    UNION ALL
    SELECT "Range 2" AS title, 26.0 AS min, 27.0 AS max
)
SELECT * from ranges
这些数字的含义无关紧要。重要的是,表2包含我要用于筛选表1的范围。表2可以这样构建:

WITH data AS (
    SELECT "Test 1" AS name, [STRUCT(1 AS id, [20, 21] AS results), STRUCT(2 AS id, [22, 23] AS results)] AS resultset
    UNION ALL
    SELECT "Test 2" AS name, [STRUCT(1 AS id, [23, 24] AS results), STRUCT(2 AS id, [25, 26] AS results)] AS resultset
    UNION ALL
    SELECT "Test 3" AS name, [STRUCT(1 AS id, [26, 27] AS results), STRUCT(2 AS id, [28, 29] AS results)] AS resultset
)
SELECT * FROM data
ranges AS (
    SELECT "Range 1" AS title, 24.0 AS min, 25.0 AS max
    UNION ALL
    SELECT "Range 2" AS title, 26.0 AS min, 27.0 AS max
)
SELECT * from ranges
我想要结束的是第一个表中的行,其中any结果与第二个表中的一个或多个范围匹配,但没有一行不匹配

我知道我可以对两个表进行一些不必要的调整和合并,以得到一个经过筛选的结果,但由于不必要,结果将包含重复项:

WITH data AS (
  SELECT "Test 1" as name, [STRUCT(1 as id, [20, 21] as results), STRUCT(2 as id, [22, 23] as results)] as resultset
  UNION ALL
  SELECT "Test 2" as name, [STRUCT(1 as id, [23, 24] as results), STRUCT(2 as id, [25, 26] as results)] as resultset
  UNION ALL
  SELECT "Test 3" as name, [STRUCT(1 as id, [26, 27] as results), STRUCT(2 as id, [28, 29] as results)] as resultset
),
ranges AS (
  SELECT "Range 1" AS title, 24.0 as min, 25.0 as max
  UNION ALL
  SELECT "Range 2" AS title, 26.0 as min, 27.0 as max
)
SELECT data.*
FROM data, UNNEST(resultset), UNNEST(results) r
JOIN ranges
ON r BETWEEN min AND max
这就是我所拥有的:

Row     name    resultset.id    resultset.results

1       Test 2             1                   23
                                               24
                           2                   25
                                               26

2       Test 2             1                   23
                                               24
                           2                   25
                                               26

3       Test 2             1                   23
                                               24
                           2                   25
                                               26

4       Test 3             1                   26
                                               27
                           2                   28
                                               29

5       Test 3             1                   26
                                               27
                           2                   28
                                               29
我要做的是在SELECT中调用DISTINCT data.*以将其缩减为两个唯一的行,并对其进行处理

换句话说,这就是我想要的:

Row     name    resultset.id    resultset.results

1       Test 2             1                   23
                                               24
                           2                   25
                                               26

2       Test 3             1                   26
                                               27
                           2                   28
                                               29
但我不能用嵌套数据来实现这一点

因此,我有两个问题:

  • 在这种情况下,如何折叠相同的行
  • 我是否走错了路,有没有更好的方法来实现这一点

  • 关于数据:我无法更改第一个表。第二个表我可以使用,如果它能带来一个简单的解决方案。

    尝试从数据集中选择所需的数据。此查询返回唯一但未列出的结果:

     SELECT data.name, rs.id, r
     FROM data
     left join UNNEST(resultset) rs
     left join UNNEST(results) as r
     JOIN ranges ON r BETWEEN min AND max
    

    下面是BigQuery标准SQL

    最简单的解决方案是(实际上不改变您已有的查询的核心)添加GROUPBY,如下所示

    #standardSQL
    SELECT ANY_VALUE(data).*
    FROM data, UNNEST(resultset), UNNEST(results) r
    JOIN ranges ON r BETWEEN min AND max
    GROUP BY TO_JSON_STRING(data)    
    
    这管用!但我不明白为什么。你能详细说明一下吗

    当然

    选择不同的。。。从…
    概念上等同于
    选择。。。分组依据

    因此,任务是为GROUPBY和相应的聚合函数(GROUPBY要求)找到适当的值


    ANY_VALUE
    TO_JSON_STRING(data)
    是我们在这里需要的

    这可能是一种最终的方法,但它在原始条件下失败:如果任何子子值与表2中的任何范围匹配,我希望整行输出:“我希望得到的是第一个表中的行,其中任何结果都与第二个表中的一个或多个范围相匹配“您应该将预期结果添加到问题中。使工作更容易。哦,哇;这管用!但我不明白为什么。你能详细说明一下吗?另外,在我尝试将其应用到实际工作流程之前,这种方法是否可以扩展?我通常会有数亿行。当然,答案是:太棒了;谢谢因此,从这个问题上退一步:我是“正确地”解决了这个问题,还是有其他/更好的方法来实现相同的结果?您的方法已经足够好了,所以我故意没有改变它,而是专注于您的主要问题-即重复输出-但几乎总是有更好的方法o)