Google bigquery 大查询中的自连接运行非常缓慢,我是否遵循最佳实践?

Google bigquery 大查询中的自连接运行非常缓慢,我是否遵循最佳实践?,google-bigquery,Google Bigquery,我正在通过以下自连接创建Reddit子Reddit之间重叠评论员数量表: SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps FROM [fh-bigquery:reddit_comments.2015_05] t1 JOIN [fh-bigquery:reddit_comments.2015_05] t2 ON t1.author=t2.author WHERE t1.subreddit<t2.subreddit GROU

我正在通过以下自连接创建Reddit子Reddit之间重叠评论员数量表:

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;
选择t1.subreddit、t2.subreddit、COUNT(*)作为NumOverlaps
来自[fh bigquery:reddit_comments.2015_05]t1
加入[fh bigquery:reddit_comments.2015_05]t2
关于t1.author=t2.author
其中t1.subreddit请在下面尝试

SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit
你应该使用简单的

    SUM(1) as NumOverlaps as NumOverlaps   
这将为您提供与使用相同的结果

    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps   
在原始查询中

因此,现在请尝试以下内容:

SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit
选择t1.subreddit、t2.subreddit、SUM(1)作为NumOverlaps
从(选择subreddit、author,将(1)计数为cnt
来自[fh bigquery:reddit_comments.2015_05]
按subreddit分组,作者cnt>1)t1
加入(选择subreddit、作者、计数(1)为cnt)
来自[fh bigquery:reddit_comments.2015_05]
按子Reddit分组,作者cnt>1)t2
关于t1.author=t2.author

其中t1.subredditOne问题-当我运行此代码时(例如cnt>5),它返回subreddit AskReddit和leagueoflegends与225458550104联合评论员(每个subreddit中有5条注释的作者)的重叠最多。这似乎不正确,因为表中总共只有54504410条注释?它应该是和(t1.cnt+t2.cnt)吗?谢谢请参见我的回答中的
跟进
:o)太棒了,谢谢!我很欣赏这一课,因为实际上我是SQL新手。
SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit