Google bigquery 使用子选择查询对字段进行聚类筛选

Google bigquery 使用子选择查询对字段进行聚类筛选,google-bigquery,Google Bigquery,使用Google Bigquery,我通过在集群字段projectId上应用过滤器来查询集群表,如下所示: WITH userProjects AS ( SELECT projectsArray FROM projectsPerUser WHERE userId = "eben@somewhere.com" ) SELECT userProperty FROM `mydata.mydatas

使用Google Bigquery,我通过在集群字段projectId上应用过滤器来查询集群表,如下所示:

WITH userProjects AS (

    SELECT 
        projectsArray 
    FROM 
        projectsPerUser 
    WHERE 
        userId = "eben@somewhere.com"
)

SELECT 
    userProperty
FROM 
    `mydata.mydataset.mytable`
WHERE 
    --projectId IN UNNEST((SELECT projectsArray FROM userProjects))
    projectId IN ("mydata", "anotherproject")
    AND _PARTITIONTIME >= "2019-03-20"
集群在上面的代码片段中正确应用,但当我在UNNESTSELECT projectsArray FROM userProjects中使用注释掉的行-projectId时,集群不适用

我也尝试过用这样的UDF包装它,但也不起作用:

CREATE TEMP FUNCTION storedValue(item ARRAY<STRING>) AS (
  item
);

...

WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList)))
正如我从中了解到的,子选择查询的执行路径不同于仅直接在标量或数组上进行过滤

我希望有一个解决方案,在这个解决方案中,我可以通过编程方式提供一个数组来进行筛选,这仍然可以让我获得集群表提供的成本效益

总之:

如果mydata中有projectId,则另一个项目[确定] 在UNNESTSELECT projectsArray FROM userProjects[Not OK] 在UnnestStoredValue中的projectId位置从projectsList中选择ProjectsListRay[不确定]
有什么想法吗?

我的建议是重写您的查询,使嵌套的SELECT成为一个临时表,您已经完成了,然后使用内部联接而不是集合成员资格测试来执行所需的筛选,这样您的查询就会变成这样:

WITH userProjects AS (

    SELECT 
        projectsArray 
    FROM 
        projectsPerUser 
    WHERE 
        userId = "eben@somewhere.com"
)

SELECT 
    userProperty
FROM 
    `mydata.mydataset.mytable` as a
    JOIN
    userProjects as b
    ON a.projectId = b.projectsArray
WHERE 
    AND _PARTITIONTIME >= "2019-03-20"

我相信这将导致一个查询,如果该字段是群集的,则该查询不会扫描整个分区。

我的建议是重写查询,使嵌套的SELECT成为一个临时表,您已经完成了,然后使用内部联接而不是集合成员资格测试来执行所需的筛选,因此,您的查询将变成这样:

WITH userProjects AS (

    SELECT 
        projectsArray 
    FROM 
        projectsPerUser 
    WHERE 
        userId = "eben@somewhere.com"
)

SELECT 
    userProperty
FROM 
    `mydata.mydataset.mytable` as a
    JOIN
    userProjects as b
    ON a.projectId = b.projectsArray
WHERE 
    AND _PARTITIONTIME >= "2019-03-20"

我相信这将导致一个查询,如果该字段是群集的,则该查询不会扫描整个分区。

FWIW,群集对我来说非常适合使用动态过滤器:

SELECT title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(TIMESTAMP_TRUNC(datehour, DAY)) = '2019-01-01'
AND wiki='en'
AND title IN ('Dogfight_(disambiguation)','Dogfight','Dogfight_(film)')
GROUP BY 1

1.8 sec elapsed, 364 MB processed
如果我这样做

AND title IN (
  SELECT DISTINCT prev 
  FROM `fh-bigquery.wikipedia_vt.clickstream_materialized` 
  WHERE date='2019-01-01' AND prev LIKE 'Dogfight%'
  ORDER BY 1  LIMIT 3)

2.9 sec elapsed, 513.8 MB processed
如果我转到v2非群集,而不是v3:

FROM `fh-bigquery.wikipedia_v2.pageviews_2019`

2.6 sec elapsed, 9.6 GB processed

我不确定您的表中发生了什么,但重新查看可能会很有趣。

FWIW,使用动态过滤器,集群对我来说非常有效:

SELECT title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(TIMESTAMP_TRUNC(datehour, DAY)) = '2019-01-01'
AND wiki='en'
AND title IN ('Dogfight_(disambiguation)','Dogfight','Dogfight_(film)')
GROUP BY 1

1.8 sec elapsed, 364 MB processed
如果我这样做

AND title IN (
  SELECT DISTINCT prev 
  FROM `fh-bigquery.wikipedia_vt.clickstream_materialized` 
  WHERE date='2019-01-01' AND prev LIKE 'Dogfight%'
  ORDER BY 1  LIMIT 3)

2.9 sec elapsed, 513.8 MB processed
如果我转到v2非群集,而不是v3:

FROM `fh-bigquery.wikipedia_v2.pageviews_2019`

2.6 sec elapsed, 9.6 GB processed

我不确定你们的桌子上发生了什么,但重新审视一下可能会很有趣。

这太棒了,丹尼尔!谢谢你的回答,这太棒了,丹尼尔!谢谢你的回答