Python word\u标记化类型错误:应为字符串或缓冲区
调用Python word\u标记化类型错误:应为字符串或缓冲区,python,python-3.x,nlp,nltk,tokenize,Python,Python 3.x,Nlp,Nltk,Tokenize,调用word\u tokenize时,我收到以下错误: File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer 我有一个大的文本文件(1500.tx
word\u tokenize
时,我收到以下错误:
File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322,
in _slices_from_text for match in
self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
我有一个大的文本文件(1500.txt),我想从中删除停止字。
我的代码如下:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as File_1500:
stop_words = set(stopwords.words("english"))
words = word_tokenize(File_1500)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)
word\u tokenize
的输入是文档流语句,即字符串列表,例如[“这是句子1.”,“那是句子2!”]
File_1500
是一个File
对象,而不是字符串列表,这就是它不起作用的原因
要获得句子字符串列表,首先必须将文件作为字符串对象读取fin.read()
,然后使用sent\u tokenize
将句子拆分(我假设您的输入文件没有句子标记化,只是一个原始文本文件)
此外,使用NLTK以这种方式标记文件更好/更惯用:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words("english"))
with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as fin:
for sent in sent_tokenize(fin.read()):
words = word_tokenize(sent)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)
你犯了什么错误?你怎么知道它不工作了?哦,它说它需要一个字符串,但你在传递一个文件。传递它
File\u 1500.read()
给它一个字符串。@SaqibAlam将此words=word\u标记化(File\u 1500)
更改为此words=word\u标记化(File\u 1500.read())
重复