Python 计算三个单词的频率

Python 计算三个单词的频率,python,string,python-3.x,counter,Python,String,Python 3.x,Counter,我有下面的代码来查找两个单词短语的频率。我需要对三个单词短语做同样的练习 但是,下面的代码似乎不适用于3个单词的短语 from collections import Counter import re sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying" words = re.findall(r'\w+', sentence) two_words = [' '.

我有下面的代码来查找两个单词短语的频率。我需要对三个单词短语做同样的练习

但是,下面的代码似乎不适用于3个单词的短语

from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
two_words = [' '.join(ws) for ws in zip(words, words[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}
尝试
zip(单词,单词[1:],单词[2:])

Ex:

from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)

three_words = [' '.join(ws) for ws in zip(words, words[1:], words[2:])]
wordscount = {w:f for w, f in Counter(three_words).most_common() if f > 1}
print( wordscount )
{'show makes me': 2}
输出:

from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)

three_words = [' '.join(ws) for ws in zip(words, words[1:], words[2:])]
wordscount = {w:f for w, f in Counter(three_words).most_common() if f > 1}
print( wordscount )
{'show makes me': 2}

我建议将功能分解为:

那你就可以了

two_words = [" ".join(bigram) for bigram in nwise(words, 2))]

等等。 然后,您可以使用
集合。除此之外,还可以使用计数器

three_word_counts = Counter(" ".join(trigram) for trigram in nwise(words, 3))

您可以在一个由3个单词组成的列表上使用
collections.Counter
。后者是通过生成器理解和列表切片构建的

from collections import Counter

three_words = (words[i:i+3] for i in range(len(words)-2))
counts = Counter(map(tuple, three_words))
wordscount = {' '.join(word): freq for word, freq in counts.items() if freq > 1}

print(wordscount)

{'show makes me': 2}
请注意,为了避免不必要的重复字符串操作,我们在最后才使用
str.join
。此外,
计数器需要进行
元组
转换,因为
dict
键必须是可散列的。

那么:

from collections import Counter

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = sentence.split()
r = Counter([' '.join(words[i:i+3]) for i in range(len(words)-3)])

>>> r.most_common()[0] #get the most common 3-words
('show makes me', 2)

这很好。我只是觉得昂贵的
str.join
应该延迟到最小计数步骤的最终筛选。@jpp我怀疑这会是一个问题,但您也可以直接将
nwise(words,3)
输入计数器并根据需要
str.join