Google bigquery 为什么我的Google BigQuery查询花了这么长时间?
使用我们的测试数据集运行以下查询花费了18分钟:Google bigquery 为什么我的Google BigQuery查询花了这么长时间?,google-bigquery,Google Bigquery,使用我们的测试数据集运行以下查询花费了18分钟: SELECT count(distinct S1.visitorId, 50000) as returningVisitors, STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)), '%Y-%m-%d') AS day, S1.dimension1, S1.dimension2 FROM [myDataset.MyTable] as S1 JOIN EACH [my
SELECT count(distinct S1.visitorId, 50000) as returningVisitors,
STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)), '%Y-%m-%d') AS day,
S1.dimension1, S1.dimension2
FROM [myDataset.MyTable] as S1
JOIN EACH [myDataset.MyTable] as S2 on S1.visitorId= S2.visitorId
WHERE UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)) < UTC_USEC_TO_DAY(NOW()) and
S2.timeStamp < STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)), '%Y-%m-%d')
GROUP EACH BY S1.dimension1, S1.dimension2, day
ORDER BY S1.dimension1, S1.dimension2, day;
选择count(不同的S1.visitorId,50000)作为返回访问者,
STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(解析UTC_USEC(S1.时间戳)),“%Y-%m-%d”)作为日期,
S1.尺寸1,S1.尺寸2
从[myDataset.MyTable]作为S1
将每个[myDataset.MyTable]作为S2连接到S1上。visitorId=S2.visitorId
其中UTC_USEC_TO_DAY(解析UTC_USEC(S1.timeStamp))
最后,我在web浏览器中收到以下消息:
查询完成(1112.1秒,处理1.62 MB)
我想知道为什么花了这么长时间。我通常使用BigQuery获得更快的结果
该查询对同一个表进行联接,以获取每天返回的访问者数量和维度。我预计查询可能需要5-6分钟,但不是18分钟,特别是因为表没有那么大
我的表大约有31000行,总大小为4.25MB。
我的工作id是:job_b657aceeb1004994b0b03324461cdcd2这个查询处理起来还需要那么长时间吗?如果只发生一次,“为什么”可能是一个罕见的内部性能问题 告诉我是否正确:您自加入表的唯一原因是检查用户以前是否去过?在这种情况下,您生成的组合数呈指数级增长(我用的这个词对吗?),而无需这样做。查询只引用S2一次,以检查它是否小于当前行的时间戳日 如果替换:
JOIN EACH [myDataset.MyTable] as S2 on S1.visitorId= S2.visitorId
与:
要获得:
SELECT count(distinct S1.visitorId, 50000) as returningVisitors,
STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)), '%Y-%m-%d') AS day,
S1.dimension1, S1.dimension2
FROM [myDataset.MyTable] as S1
JOIN EACH
(SELECT visitorId, MIN(timeStamp) timeStamp FROM [myDataset.MyTable] GROUP EACH BY 1) S2
ON S1.visitorId= S2.visitorId WHERE UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)) < UTC_USEC_TO_DAY(NOW()) and
S2.timeStamp < STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)), '%Y-%m-%d')
GROUP EACH BY S1.dimension1, S1.dimension2, day
ORDER BY S1.dimension1, S1.dimension2, day;
选择count(不同的S1.visitorId,50000)作为返回访问者,
STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(解析UTC_USEC(S1.时间戳)),“%Y-%m-%d”)作为日期,
S1.尺寸1,S1.尺寸2
从[myDataset.MyTable]作为S1
加入
(从[myDataset.MyTable]中选择visitorId,MIN(timeStamp)timeStamp,按1分组)S2
在S1.visitorId=S2.visitorId,其中UTC_USEC_TO_DAY(解析UTC_USEC(S1.timeStamp))
?
一些注意事项:
- 尝试用一个具体的datetime替换NOW(),这样可以缓存查询
SELECT count(distinct S1.visitorId, 50000) as returningVisitors,
STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)), '%Y-%m-%d') AS day,
S1.dimension1, S1.dimension2
FROM [myDataset.MyTable] as S1
JOIN EACH
(SELECT visitorId, MIN(timeStamp) timeStamp FROM [myDataset.MyTable] GROUP EACH BY 1) S2
ON S1.visitorId= S2.visitorId WHERE UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)) < UTC_USEC_TO_DAY(NOW()) and
S2.timeStamp < STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(S1.timeStamp)), '%Y-%m-%d')
GROUP EACH BY S1.dimension1, S1.dimension2, day
ORDER BY S1.dimension1, S1.dimension2, day;