Python TidfVectorizer-TypeError:TypeError:应为字符串或类似字节的对象_Python_Scikit Learn

Python TidfVectorizer-TypeError:TypeError:应为字符串或类似字节的对象

python scikit-learn

Python TidfVectorizer-TypeError:TypeError:应为字符串或类似字节的对象,python,scikit-learn,Python,Scikit Learn,我试图将TfidfVectorizer对象适配到视频游戏评论列表中，但由于某些原因，我遇到了一个错误这是我的密码： from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(max_features = 50000, use_idf = True, ngram_range=(1,3), pre

我试图将TfidfVectorizer对象适配到视频游戏评论列表中，但由于某些原因，我遇到了一个错误

这是我的密码：

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features = 50000, use_idf = True, ngram_range=(1,3),
                                   preprocessor = data_preprocessor.preprocess_tokenized_review)

print(train_set_x[0])
%time tfidf_matrix = tfidf_vectorizer.fit_transform(train_set_x)

下面是错误消息：

I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Online Pass, and the all powerful Battlefield 4 Beta access.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
   1379             Tf-idf-weighted document-term matrix.
   1380         """
-> 1381         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1382         self._tfidf.fit(X)
   1383         # X is already a transformed view of raw_documents so

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
    867 
    868         vocabulary, X = self._count_vocab(raw_documents,
--> 869                                           self.fixed_vocabulary_)
    870 
    871         if self.binary:

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
    790         for doc in raw_documents:
    791             feature_counter = {}
--> 792             for feature in analyze(doc):
    793                 try:
    794                     feature_idx = vocabulary[feature]

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
    264 
    265             return lambda doc: self._word_ngrams(
--> 266                 tokenize(preprocess(self.decode(doc))), stop_words)
    267 
    268         else:

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
    239             return self.tokenizer
    240         token_pattern = re.compile(self.token_pattern)
--> 241         return lambda doc: token_pattern.findall(doc)
    242 
    243     def get_stop_words(self):

TypeError: expected string or bytes-like object

我还没来得及打这场战役，但多人游戏很可靠，也很有趣。包括零暗三十包、在线通行证和全能战场4测试版。
---------------------------------------------------------------------------
TypeError回溯（最近一次调用上次）
在（）
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in fit\u transform（self，raw\u documents，y）
1379 Tf idf加权文件术语矩阵。
1380         """
->1381 X=super（TfidfVectorizer，self）.fit\u转换（原始文档）
1382自装配（X）
1383#X已经是原始文档的转换视图，因此
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in fit\u transform（self，raw\u documents，y）
867
868词汇表，X=自我统计词汇表（原始文档，
-->869自我修正（词汇）
870
871如果self.binary：
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in\u count\u vocab（self、raw\u documents、fixed\u vocab）
790对于原始文档中的文档：
791特征_计数器={}
-->792用于分析中的功能（文档）：
793尝试：
794 feature_idx=词汇[特征]
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in（doc）
264
265返回lambda文档：self.\u word\n rams(
-->266标记化（预处理（自解码（doc））、停止字）
267
268其他：
~/anaconda3/lib/python3.6/site-packages/sklearn/feature\u extraction/text.py in（doc）
239返回自标记器
240令牌\u模式=重新编译（self.token\u模式）
-->241返回lambda doc:token_pattern.findall（doc）
242
243 def get_stop_单词（自我）：
TypeError:应为字符串或类似字节的对象

请注意，输出的第一部分表示我的视频游戏数据集中的一个评论。如果有人知道发生了什么，我将非常感谢您的帮助。提前谢谢！

我认为这个问题是由

数据预处理器。预处理\u标记化\u评论

功能（您没有共享）引起的

证明（使用默认的

预处理器=None

）：

[19]中的

：从sklearn.feature\u extraction.text导入TfidfVectorizer
在[20]：X=[“我还没来得及玩这个游戏，但是多人游戏很稳定而且很有趣。包括零黑暗三十包，一个在线游戏
…：e通行证，以及全能战场4测试版访问。“]
在[21]中：tfidf\U矢量化器=tfidf矢量化器（最大特征数=50000，使用\U idf=True，ngram\U范围=（1,3））
在[22]中：r=tfidf\u向量化器。fit\u变换（X）
In[25]：r
出[25]：

因此，当我们不为

预处理器

参数传递任何值时，它工作正常。

我会研究它。但如果我不传递任何预处理器参数，我的数据会变得非常混乱。我需要删除所有的数字punc等。顺便说一句，谢谢你的帮助

In [19]: from sklearn.feature_extraction.text import TfidfVectorizer

In [20]: X = ["I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Onlin
    ...: e Pass, and the all powerful Battlefield 4 Beta access."]

In [21]: tfidf_vectorizer = TfidfVectorizer(max_features=50000, use_idf=True, ngram_range=(1,3))

In [22]: r = tfidf_vectorizer.fit_transform(X)

In [25]: r
Out[25]:
<1x84 sparse matrix of type '<class 'numpy.float64'>'
        with 84 stored elements in Compressed Sparse Row format>