Google bigquery 如何使用BigQuery查找最常见的bi图?
我想在我的表格中找到最常见的双格词对。如何使用BigQuery执行此操作?BigQuery现在支持拆分:Google bigquery 如何使用BigQuery查找最常见的bi图?,google-bigquery,n-gram,Google Bigquery,N Gram,我想在我的表格中找到最常见的双格词对。如何使用BigQuery执行此操作?BigQuery现在支持拆分: SELECT word, nextword, COUNT(*) c FROM ( SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM ( SELECT created_utc, title, word, pos FROM FLATTEN(
SELECT word, nextword, COUNT(*) c
FROM (
SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM (
SELECT created_utc, title, word, pos FROM FLATTEN(
(SELECT created_utc, title, word, POSITION(word) pos FROM
(SELECT created_utc, title, SPLIT(title, ' ') word FROM [bigquery-samples:reddit.full])
), word)
))
WHERE nextword IS NOT null
GROUP EACH BY 1, 2
ORDER BY c DESC
LIMIT 100
标准SQL版本:
SELECT word, nextword, COUNT(*) c FROM (
SELECT pos, title, word, LEAD(word) OVER(PARTITION BY created_utc,title ORDER BY pos) nextword FROM (
SELECT created_utc, title, word, pos FROM (
SELECT created_utc, title, SPLIT(title, ' ') word FROM `bigquery-samples.reddit.full`), UNNEST(word) as word WITH OFFSET pos))
WHERE nextword IS NOT null
GROUP BY 1, 2
ORDER BY c DESC
LIMIT 100
在取消对数组的测试时,可以使用以下语法检索该元素的位置:
UNNEST(word) as word WITH OFFSET pos
现在使用一个新函数:ML.NGRAMS:
以数据为基础
选择REGEXP\u EXTRACT\u ALLLOWERtitle,[a-z]+'title\u arr
来自“fh bigquery.reddit_posts.2019_08”
其中标题类似“%”
并且得分>1
选择大约顶部,顶部10
从…起
选择ML.NGRAMStitle_arr,[2,2]x
从数据
,UNNESTx bigram
其中LENGTHbigram>10
文件:
美丽的在哪里可以找到一些文件?很快就到了。我迫不及待地想分享这个消息:@FelipeHoffa这是在使用LegacySQL,你有没有可能用StandardSQL更新它?在下面我的新答案中查看新的ML.NGRAMS函数在下面我的新答案中查看新的ML.NGRAMS函数