Python 计算三个单词的频率_Python_String_Python 3.x_Counter

Python 计算三个单词的频率

python string python-3.x

Python 计算三个单词的频率,python,string,python-3.x,counter,Python,String,Python 3.x,Counter,我有下面的代码来查找两个单词短语的频率。我需要对三个单词短语做同样的练习但是，下面的代码似乎不适用于3个单词的短语 from collections import Counter import re sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying" words = re.findall(r'\w+', sentence) two_words = [' '.

我有下面的代码来查找两个单词短语的频率。我需要对三个单词短语做同样的练习

但是，下面的代码似乎不适用于3个单词的短语

from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
two_words = [' '.join(ws) for ws in zip(words, words[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}

尝试

zip（单词，单词[1:]，单词[2:]）

Ex:

from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)

three_words = [' '.join(ws) for ws in zip(words, words[1:], words[2:])]
wordscount = {w:f for w, f in Counter(three_words).most_common() if f > 1}
print( wordscount )

{'show makes me': 2}

输出：

from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)

three_words = [' '.join(ws) for ws in zip(words, words[1:], words[2:])]
wordscount = {w:f for w, f in Counter(three_words).most_common() if f > 1}
print( wordscount )

{'show makes me': 2}

我建议将功能分解为：

那你就可以了

two_words = [" ".join(bigram) for bigram in nwise(words, 2))]

及

等等。然后，您可以使用

集合。除此之外，还可以使用计数器：
three_word_counts = Counter(" ".join(trigram) for trigram in nwise(words, 3))

您可以在一个由3个单词组成的列表上使用collections.Counter
。后者是通过生成器理解和列表切片构建的
from collections import Counter

three_words = (words[i:i+3] for i in range(len(words)-2))
counts = Counter(map(tuple, three_words))
wordscount = {' '.join(word): freq for word, freq in counts.items() if freq > 1}

print(wordscount)

{'show makes me': 2}

请注意，为了避免不必要的重复字符串操作，我们在最后才使用str.join
。此外，计数器需要进行元组
转换，因为dict
键必须是可散列的。
那么：
from collections import Counter

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = sentence.split()
r = Counter([' '.join(words[i:i+3]) for i in range(len(words)-3)])

>>> r.most_common()[0] #get the most common 3-words
('show makes me', 2)

这很好。我只是觉得昂贵的str.join
应该延迟到最小计数步骤的最终筛选。@jpp我怀疑这会是一个问题，但您也可以直接将nwise（words，3）
输入计数器并根据需要str.join
。