Python NLTK二元内存查找器的问题_Python_Nlp_Nltk

Python NLTK二元内存查找器的问题

python nlp

Python NLTK二元内存查找器的问题,python,nlp,nltk,Python,Nlp,Nltk,我有一个标记为“all.txt”的文本文件，其中包含一个常规的英文段落出于某种原因，当我运行此代码时： import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() # change this

我有一个标记为“all.txt”的文本文件，其中包含一个常规的英文段落

出于某种原因，当我运行此代码时：

    import nltk
    from nltk.collocations import *
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    trigram_measures = nltk.collocations.TrigramAssocMeasures()

    # change this to read in your data                                                                                                                                                   
    finder = BigramCollocationFinder.from_words(('all.txt'))

    # only bigrams that appear 3+ times                                                                                                                                                  
    #finder.apply_freq_filter(3)                                                                                                                                                         

    # return the 10 n-grams with the highest PMI                                                                                                                                         
    print finder.nbest(bigram_measures.pmi, 10)

我得到以下结果：

       [('.', 't'), ('a', 'l'), ('l', '.'), ('t', 'x'), ('x', 't')]

既然我只收到信件，我做错了什么？我在找单词而不是字母

下面是“all.txt”中的示例，因此您可以了解正在处理的内容：

不仅是民主党人反对这项计划。全国各地的美国人都表示反对这项计划。我的民主党同事和我有一个更好的计划，将加强道德规范，以改善国会问责制，并确保立法得到适当考虑。共和党的计划未能成功这是一个漏洞，允许在成员阅读之前考虑立法。”

第一个问题是，您实际上并没有在中读取文件，您只是将包含文件路径的字符串传递给函数，第二个问题是您首先需要使用标记器。要解决第二个问题：

from nltk.tokenize import word_tokenize
finder = BigramCollocationFinder.from_words(word_tokenize("This is a test sentence"))
print finder.nbest(bigram_measures.pmi, 10)

产生

[（'This'，'is'），（'a'，'test'），（'is'，'a'），（'test'，'句子'）]

请注意，您可能希望使用不同的标记器——标记化包文档将详细解释各种选项

在第一种情况下，您可以使用以下内容：

with open('all.txt', 'r') as data_file:
    finder = BigramCollocationFinder.from_words(word_tokenize(data_file.read())

向上投票支持未注释的向下投票。