在Python 3.3.2中计算短语频率

在Python 3.3.2中计算短语频率,python,python-3.x,count,frequency,phrase,Python,Python 3.x,Count,Frequency,Phrase,我一直在研究网络上的不同来源,并尝试了各种方法,但只能找到如何计算独特单词的频率,而不能计算独特短语的频率。我目前掌握的守则如下: import collections import re wanted = set(['inflation', 'gold', 'bank']) cnt = collections.Counter() words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower()) for word

我一直在研究网络上的不同来源,并尝试了各种方法,但只能找到如何计算独特单词的频率,而不能计算独特短语的频率。我目前掌握的守则如下:

import collections
import re
wanted = set(['inflation', 'gold', 'bank'])
cnt = collections.Counter()
words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower())
for word in words:
    if word in wanted:
        cnt [word] += 1
print (cnt)

如果可能的话,我还想计算一下“中央银行”和“高通胀”在本文中的使用次数。非常感谢您提供的任何建议或指导。

首先,这是我将如何生成您所做的
cnt
(以减少内存开销)

现在,请回答有关短语的问题:

from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))   
next(fw2)
for w1,w2 in zip(fw1, fw2)):
  phrase = ' '.join([w1, w2])
  if phrase in phrases:
    cnt[phrase] += 1

希望这有助于

假设文件不是很大-这是最简单的方法

for w1, w2 in zip(words, words[1:]):
    phrase = w1 + " " + w2
    if phrase in wanted:
        cnt[phrase] += 1
print(cnt)

要计算小文件中两个短语的文字出现次数,请执行以下操作:

with open("input_text.txt") as file:
    text = file.read()
n = text.count("high inflation rate")
有一个模块提供了识别经常连续出现的单词的工具:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder

# run nltk.download() if there are files missing
words = [word.casefold() for sentence in sent_tokenize(text)
         for word in word_tokenize(sentence)]
words_fd = nltk.FreqDist(words)
bigram_fd = nltk.FreqDist(nltk.bigrams(words))
finder = BigramCollocationFinder(word_fd, bigram_fd)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 5))
print(finder.score_ngrams(bigram_measures.raw_freq))

# finder can be constructed from words directly
finder = TrigramCollocationFinder.from_words(words)
# filter words
finder.apply_word_filter(lambda w: w not in wanted)
# top n results
trigram_measures = nltk.collocations.TrigramAssocMeasures()
print(sorted(finder.nbest(trigram_measures.raw_freq, 2)))

在Python3.3中,可以使用
yield from
短语
变成
True
False
。所以短语中的短语总是产生False。这个代码不会产生OP想要的结果。使用
中央银行高通胀
作为文件内容,并使用
中央银行高通胀
尝试该代码。您可能需要使用类似于
itertools.tee的工具。请参阅
pairwise
recipe from.@falsetru:感谢您的错误报告和“
yield from
”注释。请让我知道更改是否有帮助现在您正在处理文件两次。让我编辑它,使用
tee
你想在文本中找到的频率吗?@J.F.Sebastian,在某种程度上,但具体的频率,如“高通胀率”之类的短语的频率。相关:嘿,格尼布尔,谢谢你的伟大见解!但是,当我将这部分代码与上面的第一个代码片段合并时,它会返回一条错误消息,指示无法识别“words”。你知道为什么吗?再次感谢您的帮助。
单词
只是您问题中的单词列表。
for
循环组合成对的单词来创建(两个单词)短语
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder

# run nltk.download() if there are files missing
words = [word.casefold() for sentence in sent_tokenize(text)
         for word in word_tokenize(sentence)]
words_fd = nltk.FreqDist(words)
bigram_fd = nltk.FreqDist(nltk.bigrams(words))
finder = BigramCollocationFinder(word_fd, bigram_fd)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 5))
print(finder.score_ngrams(bigram_measures.raw_freq))

# finder can be constructed from words directly
finder = TrigramCollocationFinder.from_words(words)
# filter words
finder.apply_word_filter(lambda w: w not in wanted)
# top n results
trigram_measures = nltk.collocations.TrigramAssocMeasures()
print(sorted(finder.nbest(trigram_measures.raw_freq, 2)))