Python NLTK语料库的导入与使用_Python_Nltk

Python NLTK语料库的导入与使用

python

Python NLTK语料库的导入与使用,python,nltk,Python,Nltk,请，请，请帮忙。我有一个文件夹，里面装满了我想用NLTK分析的文本文件。如何将其作为语料库导入，然后在其上运行NLTK命令？我把下面的代码放在一起，但它给了我这个错误： raise error, v # invalid expression sre_constants.error: nothing to repeat 代码： import nltk import re from nltk.corpus.reader.plaintext import PlaintextCorpusRea

请，请，请帮忙。我有一个文件夹，里面装满了我想用NLTK分析的文本文件。如何将其作为语料库导入，然后在其上运行NLTK命令？我把下面的代码放在一起，但它给了我这个错误：

    raise error, v # invalid expression
sre_constants.error: nothing to repeat

代码：

import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus_root = '/Users/jt/Documents/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '*.txt')

print "Finished importing corpus" 

words = FreqDist()

for sentence in speeches.sents():
    for word in sentence:
        words.inc(word.lower())

print words["he"]
print words.freq("he")

我理解这个问题与一个已知的（可能是一个特性？）有关，这在中有部分解释。简言之，某些关于空事物的正则表达式会爆炸

错误的来源是您的

speechs=

line。您应该将其更改为以下内容：

speeches = PlaintextCorpusReader(corpus_root, r'.*\.txt')

然后一切都将加载并编译得很好

你没有给我们太多的支持。简而言之，哪里有错误？请为初学者提供完整的错误跟踪，然后逐步检查您的程序。您的语料库是否由目录

crspeechs

中的

.txt

文件组成？初始化

speeches

后，您是否会通过

print（speeches.fileids（））

获得文件列表？你能打印一些应该由

speechs.sents（）

返回的句子吗？谢谢！！完美的解决方案。无论何时使用语料库，我都必须不断加载它，还是现在我可以只在nltk脚本的顶部编写导入演讲稿？很好，mixedmath！但这不是一个bug：以

开头的regexp格式不正确。（不过，错误消息可能会提供更多信息。）让我们澄清一下：

*.txt

，OP尝试过的是一个glob，它匹配扩展名为

.txt

的所有文件。但是NLTK的语料库读者不接受globs，他们接受完整的正则表达式@mixedmath的解决方案将@Jolijt的glob转换为等效的regexp，

*\.txt

。