Python word\u标记化类型错误：应为字符串或缓冲区_Python_Python 3.x_Nlp_Nltk_Tokenize

Python word\u标记化类型错误：应为字符串或缓冲区

python python-3.x nlp

Python word\u标记化类型错误：应为字符串或缓冲区,python,python-3.x,nlp,nltk,tokenize,Python,Python 3.x,Nlp,Nltk,Tokenize,调用word\u tokenize时，我收到以下错误： File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer 我有一个大的文本文件（1500.tx

调用

word\u tokenize

时，我收到以下错误：

File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322,
    in _slices_from_text for match in
    self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

我有一个大的文本文件（1500.txt），我想从中删除停止字。我的代码如下：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as File_1500:
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(File_1500)
    filtered_sentence = [w for w in words if not w in stop_words]
    print(filtered_sentence)

word\u tokenize

的输入是文档流语句，即字符串列表，例如

[“这是句子1.”，“那是句子2！”]

File_1500

是一个

File

对象，而不是字符串列表，这就是它不起作用的原因

要获得句子字符串列表，首先必须将文件作为字符串对象读取

fin.read（）

，然后使用

sent\u tokenize

将句子拆分（我假设您的输入文件没有句子标记化，只是一个原始文本文件）

此外，使用NLTK以这种方式标记文件更好/更惯用：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words("english"))

with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as fin:
    for sent in sent_tokenize(fin.read()):
        words = word_tokenize(sent)
        filtered_sentence = [w for w in words if not w in stop_words]
        print(filtered_sentence)

你犯了什么错误？你怎么知道它不工作了？哦，它说它需要一个字符串，但你在传递一个文件。传递它

File\u 1500.read（）

给它一个字符串。@SaqibAlam将此

words=word\u标记化（File\u 1500）

更改为此

words=word\u标记化（File\u 1500.read（））

重复