Google bigquery BigQuery运行总计

Google bigquery BigQuery运行总计,google-bigquery,window-functions,cumulative-sum,Google Bigquery,Window Functions,Cumulative Sum,我在BigQuery中运行总计时遇到问题 我在这里找到了一个有效的例子: 但我真正想做的是计算最流行的词的数量,这些词占总词数的80%。因此,我尝试在按word_count排序时首先计算运行总数: SELECT word, word_count, SUM(word_count) OVER(ORDER BY word_count DESC) FROM [publicdata:samples.shakespeare] WHERE corpus = 'hamlet' AND word > '

我在BigQuery中运行总计时遇到问题

我在这里找到了一个有效的例子:

但我真正想做的是计算最流行的词的数量,这些词占总词数的80%。因此,我尝试在按word_count排序时首先计算运行总数:

SELECT word, word_count, SUM(word_count) OVER(ORDER BY word_count DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus  = 'hamlet'
AND word > 'a' LIMIT 30
但我明白了:

Row word    word_count  f0_  
1   o'er    18          18   
2   answer  13          31   
3   meet    8           39   
4   told    5           44   
5   treason 4           **52**   
6   quality 4           **52**   
7   brave   3           55  
运行总数没有从5号线增加到6号线。可能是因为在这两种情况下,单词数都是4

我做错了什么

也许有更好的办法?我的计划是计算总跑步量。然后将其除以sum(word_count)OVER()并仅过滤小于80%的行。然后计算这些行的数量

首先,删除“LIMIT 30”-它将干扰OVER()子句

你想要一个比例?尝试比率报告:

SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus  = 'hamlet'
AND word > 'a' 
是否希望具有相同值的连续行以任何方式增加?确定这些行的顺序,并使用次顺序:

SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word)
FROM [publicdata:samples.shakespeare]
WHERE corpus  = 'hamlet'
AND word > 'a' 
你想要涵盖80%的最流行词汇吗?将这些比率相加,过滤掉剩下的:

SELECT word, word_count, sum_ratio
FROM (
 SELECT word, word_count, SUM(ratio) OVER(ORDER BY ratio, word) sum_ratio
 FROM (
    SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word) ratio
    FROM [publicdata:samples.shakespeare]
    WHERE corpus  = 'hamlet'
    AND word > 'a' 
 )
)
WHERE sum_ratio>0.8

Row word    word_count  sum_ratio    
1   is      313         0.8125175752219499   
2   it      361         0.827019644076648    
3   in      400         0.8430884184308841   
4   my      441         0.8608042421564295   
5   you     499         0.8808500381633391   
6   of      630         0.906158357771261    
7   to      635         0.9316675370586108   
8   and     706         0.9600289237938375   
9   the     995         0.9999999999999999  

谢谢!这是一门真正的关于窗口函数的课程。
SELECT word, word_count, sum_ratio
FROM (
 SELECT word, word_count, SUM(ratio) OVER(ORDER BY ratio, word) sum_ratio
 FROM (
    SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word) ratio
    FROM [publicdata:samples.shakespeare]
    WHERE corpus  = 'hamlet'
    AND word > 'a' 
 )
)
WHERE sum_ratio>0.8

Row word    word_count  sum_ratio    
1   is      313         0.8125175752219499   
2   it      361         0.827019644076648    
3   in      400         0.8430884184308841   
4   my      441         0.8608042421564295   
5   you     499         0.8808500381633391   
6   of      630         0.906158357771261    
7   to      635         0.9316675370586108   
8   and     706         0.9600289237938375   
9   the     995         0.9999999999999999