Google bigquery 如何使用BigQuery查找最常见的bi图?

Google bigquery 如何使用BigQuery查找最常见的bi图?,google-bigquery,n-gram,Google Bigquery,N Gram,我想在我的表格中找到最常见的双格词对。如何使用BigQuery执行此操作?BigQuery现在支持拆分: SELECT word, nextword, COUNT(*) c FROM ( SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM ( SELECT created_utc, title, word, pos FROM FLATTEN(

我想在我的表格中找到最常见的双格词对。如何使用BigQuery执行此操作?

BigQuery现在支持拆分:

SELECT word, nextword, COUNT(*) c 
FROM (
SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM (
SELECT created_utc, title, word, pos FROM FLATTEN(
  (SELECT created_utc, title, word, POSITION(word) pos FROM
   (SELECT created_utc, title, SPLIT(title, ' ') word FROM [bigquery-samples:reddit.full])
  ), word)
))
WHERE nextword IS NOT null
GROUP EACH BY 1, 2
ORDER BY c DESC
LIMIT 100

标准SQL版本:

SELECT word, nextword, COUNT(*) c FROM (
SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM (
SELECT created_utc, title, word, pos FROM ( 
    SELECT created_utc, title, SPLIT(title, ' ') word FROM `bigquery-samples.reddit.full`), UNNEST(word) as word WITH OFFSET pos))
WHERE nextword IS NOT null
GROUP BY 1, 2
ORDER BY c DESC
LIMIT 100
在取消对数组的测试时,可以使用以下语法检索该元素的位置:

UNNEST(word) as word WITH OFFSET pos
现在使用一个新函数:ML.NGRAMS:

以数据为基础 选择REGEXP\u EXTRACT\u ALLLOWERtitle,[a-z]+'title\u arr 来自“fh bigquery.reddit_posts.2019_08” 其中标题类似“%” 并且得分>1 选择大约顶部,顶部10 从…起 选择ML.NGRAMStitle_arr,[2,2]x 从数据 ,UNNESTx bigram 其中LENGTHbigram>10 文件:


美丽的在哪里可以找到一些文件?很快就到了。我迫不及待地想分享这个消息:@FelipeHoffa这是在使用LegacySQL,你有没有可能用StandardSQL更新它?在下面我的新答案中查看新的ML.NGRAMS函数在下面我的新答案中查看新的ML.NGRAMS函数