Python TidfVectorizer-TypeError:TypeError:应为字符串或类似字节的对象
我试图将TfidfVectorizer对象适配到视频游戏评论列表中,但由于某些原因,我遇到了一个错误 这是我的密码:Python TidfVectorizer-TypeError:TypeError:应为字符串或类似字节的对象,python,scikit-learn,Python,Scikit Learn,我试图将TfidfVectorizer对象适配到视频游戏评论列表中,但由于某些原因,我遇到了一个错误 这是我的密码: from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(max_features = 50000, use_idf = True, ngram_range=(1,3), pre
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features = 50000, use_idf = True, ngram_range=(1,3),
preprocessor = data_preprocessor.preprocess_tokenized_review)
print(train_set_x[0])
%time tfidf_matrix = tfidf_vectorizer.fit_transform(train_set_x)
下面是错误消息:
I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Online Pass, and the all powerful Battlefield 4 Beta access.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<timed exec> in <module>()
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1379 Tf-idf-weighted document-term matrix.
1380 """
-> 1381 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1382 self._tfidf.fit(X)
1383 # X is already a transformed view of raw_documents so
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
867
868 vocabulary, X = self._count_vocab(raw_documents,
--> 869 self.fixed_vocabulary_)
870
871 if self.binary:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
790 for doc in raw_documents:
791 feature_counter = {}
--> 792 for feature in analyze(doc):
793 try:
794 feature_idx = vocabulary[feature]
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
264
265 return lambda doc: self._word_ngrams(
--> 266 tokenize(preprocess(self.decode(doc))), stop_words)
267
268 else:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
239 return self.tokenizer
240 token_pattern = re.compile(self.token_pattern)
--> 241 return lambda doc: token_pattern.findall(doc)
242
243 def get_stop_words(self):
TypeError: expected string or bytes-like object
我还没来得及打这场战役,但多人游戏很可靠,也很有趣。包括零暗三十包、在线通行证和全能战场4测试版。
---------------------------------------------------------------------------
TypeError回溯(最近一次调用上次)
在()
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in fit\u transform(self,raw\u documents,y)
1379 Tf idf加权文件术语矩阵。
1380 """
->1381 X=super(TfidfVectorizer,self).fit\u转换(原始文档)
1382自装配(X)
1383#X已经是原始文档的转换视图,因此
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in fit\u transform(self,raw\u documents,y)
867
868词汇表,X=自我统计词汇表(原始文档,
-->869自我修正(词汇)
870
871如果self.binary:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in\u count\u vocab(self、raw\u documents、fixed\u vocab)
790对于原始文档中的文档:
791特征_计数器={}
-->792用于分析中的功能(文档):
793尝试:
794 feature_idx=词汇[特征]
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in(doc)
264
265返回lambda文档:self.\u word\n rams(
-->266标记化(预处理(自解码(doc))、停止字)
267
268其他:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in(doc)
239返回自标记器
240令牌\u模式=重新编译(self.token\u模式)
-->241返回lambda doc:token_pattern.findall(doc)
242
243 def get_stop_单词(自我):
TypeError:应为字符串或类似字节的对象
请注意,输出的第一部分表示我的视频游戏数据集中的一个评论。如果有人知道发生了什么,我将非常感谢您的帮助。提前谢谢!我认为这个问题是由
数据预处理器。预处理\u标记化\u评论
功能(您没有共享)引起的
证明(使用默认的预处理器=None
):
[19]中的:从sklearn.feature\u extraction.text导入TfidfVectorizer
在[20]:X=[“我还没来得及玩这个游戏,但是多人游戏很稳定而且很有趣。包括零黑暗三十包,一个在线游戏
…:e通行证,以及全能战场4测试版访问。“]
在[21]中:tfidf\U矢量化器=tfidf矢量化器(最大特征数=50000,使用\U idf=True,ngram\U范围=(1,3))
在[22]中:r=tfidf\u向量化器。fit\u变换(X)
In[25]:r
出[25]:
因此,当我们不为
预处理器
参数传递任何值时,它工作正常。我会研究它。但如果我不传递任何预处理器参数,我的数据会变得非常混乱。我需要删除所有的数字punc等。顺便说一句,谢谢你的帮助
In [19]: from sklearn.feature_extraction.text import TfidfVectorizer
In [20]: X = ["I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Onlin
...: e Pass, and the all powerful Battlefield 4 Beta access."]
In [21]: tfidf_vectorizer = TfidfVectorizer(max_features=50000, use_idf=True, ngram_range=(1,3))
In [22]: r = tfidf_vectorizer.fit_transform(X)
In [25]: r
Out[25]:
<1x84 sparse matrix of type '<class 'numpy.float64'>'
with 84 stored elements in Compressed Sparse Row format>