Python：在分句器、单词标记器和词性标记器中遇到问题_Python_Nltk

Python：在分句器、单词标记器和词性标记器中遇到问题

python

Python：在分句器、单词标记器和词性标记器中遇到问题,python,nltk,Python,Nltk,我正在尝试将文本文件读入Python，然后执行分句器、单词标记器和词性标记器这是我的代码： file=open('C:/temp/1.txt','r') sentences = nltk.sent_tokenize(file) sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] 当我尝试第二个命令时，它显示错误：

我正在尝试将文本文件读入Python，然后执行分句器、单词标记器和词性标记器

这是我的代码：

file=open('C:/temp/1.txt','r')
sentences = nltk.sent_tokenize(file)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

当我尝试第二个命令时，它显示错误：

Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
sentences = nltk.sent_tokenize(file)
File "D:\Python\lib\site-packages\nltk\tokenize\__init__.py", line 76, in sent_tokenize
return tokenizer.tokenize(text)
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1217, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1262, in sentences_from_text
sents = [text[sl] for sl in self._slices_from_text(text)]
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1269, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
句子=nltk.sent\u标记化（文件）
文件“D:\Python\lib\site packages\nltk\tokenize\\uuuuu init\uuuuu.py”，第76行，在sent\u tokenize中
return tokenizer.tokenize（文本）
文件“D:\Python\lib\site packages\nltk\tokenize\punkt.py”，第1217行，在tokenize中
返回列表（self.句子来自文本（文本，重新对齐边界））
文件“D:\Python\lib\site packages\nltk\tokenize\punkt.py”，第1262行，来自文本中的句子
sents=[text[sl]表示自我中的sl。_从_text（text）_分割_）
文件“D:\Python\lib\site packages\nltk\tokenize\punkt.py”，第1269行，位于文本中的
对于self.\u lang\u vars.period\u context\u re（）.finditer中的匹配（文本）：
TypeError:应为字符串或缓冲区

另一种尝试：当我试着说一句话，比如“一只黄狗对着猫吠叫” 前三个命令可以工作，但最后一行，我得到了这个错误：（我想知道我是否没有完全下载软件包？）

回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
句子=[nltk.pos_标记（已发送）表示已发送的句子]
文件“D:\Python\lib\site packages\nltk\tag\\uuuuu init\uuuuuu.py”，第99行，在pos\u标记中
标记器=负载（\u位置\u标记器）
文件“D:\Python\lib\site packages\nltk\data.py”，第605行，已加载
resource\u val=pickle.load（_open（resource\u url））
ImportError:没有名为numpy.core.multiarray的模块

嗯。。。您确定错误在第二行中吗

您似乎使用的是单引号和逗号字符，而不是标准ASCII

，

和

，

字符：

file=open(‘C:/temp/1.txt’，‘r’) # your version (WRONG)
file=open('C:/temp/1.txt', 'r') # right

Python甚至不应该能够编译这个。事实上，当我尝试它时，它会因为语法错误而呕吐

更新：您发布了一个语法正确的版本。来自回溯的错误消息非常简单：您正在调用的函数似乎期望一个文本块作为其参数，而不是一个文件对象。虽然我对NLTK一无所知，但我花了五秒钟在谷歌上

试着这样做：

file = open('C:/temp/1.txt','r')
text = file.read() # read the contents of the text file into a variable
result1 = nltk.sent_tokenize(text)
result2 = [nltk.word_tokenize(sent) for sent in result1]
result3 = [nltk.pos_tag(sent) for sent in result2]

更新：我将

句子

重命名为

结果

1/2/3，因为重复覆盖同一变量导致对代码实际操作的混淆。这不会影响语义，只是澄清了第二行实际上对最终的
结果有影响3
首先打开文件，然后读取：

filename = 'C:/temp/1.txt' infile = open(filename, 'r') text = infile.read()
然后在nltk中将工具链起来，如下所示：

tagged_words = [pos_tag(word_tokenize(i) for i in sent_tokenize(text)]

很抱歉打错了。在我的代码中，我使用“而不是”。但是我仍然得到了这种类型的错误。那么您是否仍然得到完全相同的错误（表明问题在第二行）？仅第一行变成：File“”，第1行，in。其他错误是相同的。我在我的问题上又做了一次尝试，我编辑了我的问题。Dan，在你的指导下，我成功地运行了前四个命令（基于你的代码）。但是，当我运行最后一个命令时，无论我使用的是句子还是文本文档，它都会不断给出我在编辑问题中描述的错误？在以
句子=…
开头的每行后面添加
打印句子
语句。这样，您将看到中间输出。然后仔细检查最后一行的输入格式是否与预期的输入格式匹配，参见下面我更新的答案。为了获得有用的建议，您需要非常小心地精确地指定您在输入中更改的内容。很明显，
nltk.sent\u tokenize
只接受字符串输入，而不接受
文件
对象，并且您没有清楚地区分这两个对象。请您发布一篇您的
C:/temp/1.txt
外观的摘录？
tagged_words = [pos_tag(word_tokenize(i) for i in sent_tokenize(text)]