Python TFIDFvectorier:ValueError:不是内置的阻止列表:俄语
我尝试用俄语停止词应用TFIDFvectorierPython TFIDFvectorier:ValueError:不是内置的阻止列表:俄语,python,tf-idf,Python,Tf Idf,我尝试用俄语停止词应用TFIDFvectorier Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' ) Z = Tfidf.fit_transform(X) 我得到 ValueError: not a built-in stop list: russian 当我使用英语停止词时,这是正确的 Tfidf = sklearn.feature_extraction.text.TfidfVect
Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' )
Z = Tfidf.fit_transform(X)
我得到
ValueError: not a built-in stop list: russian
当我使用英语停止词时,这是正确的
Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='english' )
Z = Tfidf.fit_transform(X)
如何改进?
完全回溯
<ipython-input-118-e787bf15d612> in <module>()
1 Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' )
----> 2 Z = Tfidf.fit_transform(X)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1303 Tf-idf-weighted document-term matrix.
1304 """
-> 1305 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1306 self._tfidf.fit(X)
1307 # X is already a transformed view of raw_documents so
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
815
816 vocabulary, X = self._count_vocab(raw_documents,
--> 817 self.fixed_vocabulary_)
818
819 if self.binary:
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
745 vocabulary.default_factory = vocabulary.__len__
746
--> 747 analyze = self.build_analyzer()
748 j_indices = _make_int_array()
749 indptr = _make_int_array()
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self)
232
233 elif self.analyzer == 'word':
--> 234 stop_words = self.get_stop_words()
235 tokenize = self.build_tokenizer()
236
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in get_stop_words(self)
215 def get_stop_words(self):
216 """Build or fetch the effective stop words list"""
--> 217 return _check_stop_list(self.stop_words)
218
219 def build_analyzer(self):
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _check_stop_list(stop)
88 return ENGLISH_STOP_WORDS
89 elif isinstance(stop, six.string_types):
---> 90 raise ValueError("not a built-in stop list: %s" % stop)
91 elif stop is None:
92 return None
ValueError: not a built-in stop list: russian
()
1 Tfidf=sklearn.feature\u extraction.text.TfidfVectorizer(stop\u words='俄语〕
---->2 Z=Tfidf.fit_变换(X)
C:\Program Files\Anaconda3\lib\site packages\sklearn\feature\u extraction\text.py in fit\u transform(self,raw\u documents,y)
1303 Tf idf加权文件术语矩阵。
1304 """
->1305 X=super(TfidfVectorizer,self).fit\u转换(原始文档)
1306自装配(X)
1307#X已经是原始文档的转换视图,因此
C:\Program Files\Anaconda3\lib\site packages\sklearn\feature\u extraction\text.py in fit\u transform(self,raw\u documents,y)
815
816词汇表,X=自我统计词汇表(原始文档,
-->817自我修复(词汇)
818
819如果self.binary:
C:\Program Files\Anaconda3\lib\site packages\sklearn\feature\u extraction\text.py in\u count\u vocab(self、raw\u documents、fixed\u vocab)
745词汇表。默认工厂=词汇表__
746
-->747 analyze=self.build\u analyzer()
748 j_索引=_make_int_数组()
749 indptr=\u make\u int\u array()
C:\Program Files\Anaconda3\lib\site packages\sklearn\feature\u extraction\text.py内置分析器(self)
232
233 elif self.analyzer==“word”:
-->234停止单词=self.get\u停止单词()
235 tokenize=self.build_tokenizer()
236
C:\Program Files\Anaconda3\lib\site packages\sklearn\feature\u extraction\text.py in get\u stop\u words(self)
215 def get_stop_单词(self):
216“生成或获取有效的停止词列表”
-->217返回检查停止列表(自停止字)
218
219 def生成分析仪(自身):
C:\Program Files\Anaconda3\lib\site packages\sklearn\feature\u extraction\text.py in\u check\u stop\u list(停止)
88返回英语单词
89 elif isinstance(停止,六种字符串类型):
--->90提升值错误(“不是内置的停止列表:%s”%stop)
91如果停止为无:
92不返回
ValueError:不是内置的禁止列表:俄语
你们能在发帖前先看一下吗
停止单词:字符串{'english'}、列表或无(默认)
如果是字符串,则会将其传递到_check_stop_list,并返回相应的停止列表。“英语”是当前唯一受支持的语言
字符串值
如果是列表,则假定该列表包含停止字,所有这些字都将从生成的标记中删除。仅当
analyzer=='word'
如果没有,则不使用停止字。可以将max_df设置为[0.7,1.0]范围内的值,以自动检测和过滤停止字
基于语料库内文档术语的频率
有几个库用于删除支持更多语言的停止词,例如或