Python 计算三个单词的频率
我有下面的代码来查找两个单词短语的频率。我需要对三个单词短语做同样的练习 但是,下面的代码似乎不适用于3个单词的短语Python 计算三个单词的频率,python,string,python-3.x,counter,Python,String,Python 3.x,Counter,我有下面的代码来查找两个单词短语的频率。我需要对三个单词短语做同样的练习 但是,下面的代码似乎不适用于3个单词的短语 from collections import Counter import re sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying" words = re.findall(r'\w+', sentence) two_words = [' '.
from collections import Counter
import re
sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
two_words = [' '.join(ws) for ws in zip(words, words[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}
尝试zip(单词,单词[1:],单词[2:])
Ex:
from collections import Counter
import re
sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
three_words = [' '.join(ws) for ws in zip(words, words[1:], words[2:])]
wordscount = {w:f for w, f in Counter(three_words).most_common() if f > 1}
print( wordscount )
{'show makes me': 2}
输出:
from collections import Counter
import re
sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
three_words = [' '.join(ws) for ws in zip(words, words[1:], words[2:])]
wordscount = {w:f for w, f in Counter(three_words).most_common() if f > 1}
print( wordscount )
{'show makes me': 2}
我建议将功能分解为: 那你就可以了
two_words = [" ".join(bigram) for bigram in nwise(words, 2))]
及
等等。
然后,您可以使用集合。除此之外,还可以使用计数器:
three_word_counts = Counter(" ".join(trigram) for trigram in nwise(words, 3))
您可以在一个由3个单词组成的列表上使用collections.Counter
。后者是通过生成器理解和列表切片构建的
from collections import Counter
three_words = (words[i:i+3] for i in range(len(words)-2))
counts = Counter(map(tuple, three_words))
wordscount = {' '.join(word): freq for word, freq in counts.items() if freq > 1}
print(wordscount)
{'show makes me': 2}
请注意,为了避免不必要的重复字符串操作,我们在最后才使用str.join
。此外,计数器需要进行元组
转换,因为dict
键必须是可散列的。那么:
from collections import Counter
sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = sentence.split()
r = Counter([' '.join(words[i:i+3]) for i in range(len(words)-3)])
>>> r.most_common()[0] #get the most common 3-words
('show makes me', 2)
这很好。我只是觉得昂贵的str.join
应该延迟到最小计数步骤的最终筛选。@jpp我怀疑这会是一个问题,但您也可以直接将nwise(words,3)
输入计数器并根据需要str.join
。