Google bigquery 大查询中的自连接运行非常缓慢，我是否遵循最佳实践？_Google Bigquery

Google bigquery 大查询中的自连接运行非常缓慢，我是否遵循最佳实践？

google-bigquery

Google bigquery 大查询中的自连接运行非常缓慢，我是否遵循最佳实践？,google-bigquery,Google Bigquery,我正在通过以下自连接创建Reddit子Reddit之间重叠评论员数量表： SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps FROM [fh-bigquery:reddit_comments.2015_05] t1 JOIN [fh-bigquery:reddit_comments.2015_05] t2 ON t1.author=t2.author WHERE t1.subreddit<t2.subreddit GROU

我正在通过以下自连接创建Reddit子Reddit之间重叠评论员数量表：

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;

选择t1.subreddit、t2.subreddit、COUNT（*）作为NumOverlaps
来自[fh bigquery:reddit_comments.2015_05]t1
加入[fh bigquery:reddit_comments.2015_05]t2
关于t1.author=t2.author
其中t1.subreddit请在下面尝试
SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

你应该使用简单的
    SUM(1) as NumOverlaps as NumOverlaps   

这将为您提供与使用相同的结果
    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps   

在原始查询中
因此，现在请尝试以下内容：
SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

选择t1.subreddit、t2.subreddit、SUM（1）作为NumOverlaps
从（选择subreddit、author，将（1）计数为cnt
来自[fh bigquery:reddit_comments.2015_05]
按subreddit分组，作者cnt>1）t1
加入（选择subreddit、作者、计数（1）为cnt）
来自[fh bigquery:reddit_comments.2015_05]
按子Reddit分组，作者cnt>1）t2
关于t1.author=t2.author
其中t1.subredditOne问题-当我运行此代码时（例如cnt>5），它返回subreddit AskReddit和leagueoflegends与225458550104联合评论员（每个subreddit中有5条注释的作者）的重叠最多。这似乎不正确，因为表中总共只有54504410条注释？它应该是和（t1.cnt+t2.cnt）吗？谢谢请参见我的回答中的跟进：o）太棒了，谢谢！我很欣赏这一课，因为实际上我是SQL新手。
SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit