Python 带有count和tfidf矢量器的管道生成TypeError:应为字符串或类似字节的对象
我有一个如下的语料库 "C C 0 0 X 0 1 0 0 0","C C 0 0 X 0 1 0 0 0 0","C C 0 0 X 0 1 0 0 0 0","X X X X",, 我想使用计数和tfidf矢量器以及逻辑回归作为分类器。 下面的代码是我根据sklearn的示例改编的Python 带有count和tfidf矢量器的管道生成TypeError:应为字符串或类似字节的对象,python,scikit-learn,pipeline,tf-idf,gridsearchcv,Python,Scikit Learn,Pipeline,Tf Idf,Gridsearchcv,我有一个如下的语料库 "C C 0 0 X 0 1 0 0 0","C C 0 0 X 0 1 0 0 0 0","C C 0 0 X 0 1 0 0 0 0","X X X X",, 我想使用计数和tfidf矢量器以及逻辑回归作为分类器。 下面的代码是我根据sklearn的示例改编的 from pprint import pprint from time import time import logging import pickle from sklearn.datasets import
from pprint import pprint
from time import time
import logging
import pickle
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
print(__doc__)
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
('vect', CountVectorizer(analyzer='char',lowercase=False)),
('tfidf', TfidfVectorizer(analyzer='char',lowercase=False)),
('clf', LogisticRegression()),
])
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
# 'tfidf__use_idf': (True, False),
# 'tfidf__norm': ('l1', 'l2'),
'clf__max_iter': (1000,),
'clf__C': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
# 'clf__max_iter': (10, 50, 80),
}
if __name__ == "__main__":
# multiprocessing requires the fork to happen in a __main__ protected
# block
# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
corpus =['C C C 0 0 0 X 0 1 0 0 0 0', 'C C C 0 0 0 X 0 1 0 0 0 0', 'C C C 0 0 0 X 0 1 0 0 0 0', 'X X X', 'X X X',
'X X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X',
'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X']
y_train = [0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
print(len(corpus),len(y_train))
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
#print(type(data.data),type(data.target))
#print(data.data[:1])
#print(data.data[:2])
grid_search.fit(corpus,y_train)
print("done in %0.3fs" % (time() - t0))
print()
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
我的堆栈跟踪如下
Automatically created module for IPython interactive environment
50 50
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__C': (1e-05, 1e-06),
'clf__max_iter': (1000,),
'clf__penalty': ('l2', 'elasticnet'),
'vect__max_df': (0.5, 0.75, 1.0),
'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 0.1s finished
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-114-0d47590b1279> in <module>
107 #print(data.data[:2])
108
--> 109 grid_search.fit(corpus,y_train)
110 print("done in %0.3fs" % (time() - t0))
111 print()
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
737 refit_start_time = time.time()
738 if y is not None:
--> 739 self.best_estimator_.fit(X, y, **fit_params)
740 else:
741 self.best_estimator_.fit(X, **fit_params)
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
348 This estimator
349 """
--> 350 Xt, fit_params = self._fit(X, y, **fit_params)
351 with _print_elapsed_time('Pipeline',
352 self._log_message(len(self.steps) - 1)):
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params)
313 message_clsname='Pipeline',
314 message=self._log_message(step_idx),
--> 315 **fit_params_steps[name])
316 # Replace the transformer of the step with the fitted
317 # transformer. This is necessary when loading the transformer
E:\anaconda\envs\appliedaicourse\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1857 """
1858 self._check_params()
-> 1859 X = super().fit_transform(raw_documents)
1860 self._tfidf.fit(X)
1861 # X is already a transformed view of raw_documents so
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1218
1219 vocabulary, X = self._count_vocab(raw_documents,
-> 1220 self.fixed_vocabulary_)
1221
1222 if self.binary:
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
1129 for doc in raw_documents:
1130 feature_counter = {}
-> 1131 for feature in analyze(doc):
1132 try:
1133 feature_idx = vocabulary[feature]
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\feature_extraction\text.py in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)
108 doc = ngrams(doc, stop_words)
109 else:
--> 110 doc = ngrams(doc)
111 return doc
112
E:\anaconda\envs\appliedaicourse\lib\site-packages\sklearn\feature_extraction\text.py in _char_ngrams(self, text_document)
255 """Tokenize text_document into a sequence of character n-grams"""
256 # normalize white spaces
--> 257 text_document = self._white_spaces.sub(" ", text_document)
258
259 text_len = len(text_document)
TypeError: expected string or bytes-like object
结果
<class 'list'>
[' 0 0 0', ' 0 0 X', ' 0 1 0', ' 0 X 0', ' 0 X X', ' 1 0 0', ' C 0 0', ' C C 0', ' C C C', ' C C X', ' C X X', ' X 0 0', ' X 0 1', ' X 0 X', ' X X 0', ' X X X', '0 0 0 ', '0 0 X ', '0 1 0 ', '0 X 0 ', '1 0 0 ', 'C 0 0 ', 'C C 0 ', 'C C C ', 'C C X ', 'C X X ', 'X 0 0 ', 'X 0 1 ', 'X 0 X ', 'X X 0 ', 'X X X ']
(50, 31)
(0, 20) 0.31810783213188626
(0, 5) 0.31810783213188626
(0, 18) 0.31810783213188626
(0, 2) 0.31810783213188626
(0, 27) 0.31810783213188626
(0, 12) 0.31810783213188626
(0, 19) 0.16116825632411622
(0, 3) 0.16116825632411622
(0, 17) 0.16116825632411622
(0, 1) 0.11378963445554637
(0, 16) 0.22757926891109273
(0, 0) 0.3413689033666391
(0, 21) 0.17370780684495662
(0, 6) 0.17370780684495662
(0, 22) 0.17370780684495662
(0, 7) 0.17370780684495662
(0, 23) 0.11378963445554637
(1, 20) 0.31810783213188626
(1, 5) 0.31810783213188626
(1, 18) 0.31810783213188626
...
...
...
(49, 1) 0.01436413072356797
(49, 16) 0.01436413072356797
(49, 0) 0.01436413072356797
(49, 23) 0.6894782747312626
['0 0 0 0 0 X 0 0 X 0 0 0 X 0 0 X 0 0 X 0 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 C C X X X X X X 0 0 0 0 X 0 1 X 0 X X 0 X 0 0 0 0 X 0 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C C X X X X X 0 0 0 0 0 0 0 0 0 0 X X X X X X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X X X 0 0 0 0 0 0 X 0 X 0 X 0 X 0 X 0 X 0 X 0
(50, 31)
(0, 20) 0.31810783213188626
(0, 5) 0.31810783213188626
(0, 18) 0.31810783213188626
(0, 2) 0.31810783213188626
(0, 27) 0.31810783213188626
(0, 12) 0.31810783213188626
(0, 19) 0.16116825632411622
(0, 3) 0.16116825632411622
(0, 17) 0.16116825632411622
(0, 1) 0.11378963445554637
(0, 16) 0.22757926891109273
(0, 0) 0.3413689033666391
(0, 21) 0.17370780684495662
(0, 6) 0.17370780684495662
(0, 22) 0.17370780684495662
(0, 7) 0.17370780684495662
(0, 23) 0.11378963445554637
(1, 20) 0.31810783213188626
(1, 5) 0.31810783213188626
(1, 18) 0.31810783213188626
...
...
...
(49, 1) 0.01436413072356797
(49, 16) 0.01436413072356797
(49, 0) 0.01436413072356797
(49, 23) 0.6894782747312626
我的问题
为什么独立矢量器可以工作?但是当放置在Gridsearch使用的管道中时,默认情况下,我得到类型错误,CountVectorizer和TfidVectorizer都需要一个字符串或字节类型的项序列。在您的管道中,CountVectorizer接收语料库并使用scipy.sparse.csr_矩阵向TfidfVectorizer输出计数的稀疏表示。由于TfidfVectorizer的输入不是预期的类型,因此会出现类型错误“TypeError:expected string或bytes like object”。如果使用其中一个而不是两个矢量器,则管道可以工作。比如说,
pipeline = Pipeline([
#('vect', CountVectorizer(analyzer='char',lowercase=False)),
('tfidf', TfidfVectorizer(analyzer='char',lowercase=False)),
('clf', LogisticRegression())
])
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
#'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000, 50000),
#'vect__ngram_range': [(1, 1), (1, 2)], # unigrams or bigrams
'tfidf__use_idf': [True, False],
'tfidf__norm': ['l1', 'l2'],
'clf__max_iter': [1000],
'clf__C': [0.00001, 0.000001],
'clf__penalty': ['l2'],
# 'clf__max_iter': (10, 50, 80),
}
生成以下输出:
50 50
Performing grid search...
pipeline: ['tfidf', 'clf']
parameters:
{'clf__C': [1e-05, 1e-06],
'clf__max_iter': [1000],
'clf__penalty': ['l2'],
'tfidf__norm': ['l1', 'l2'],
'tfidf__use_idf': [True, False]}
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
done in 0.347s
Best score: 0.680
Best parameters set:
clf__C: 1e-05
clf__max_iter: 1000
clf__penalty: 'l2'
tfidf__norm: 'l1'
tfidf__use_idf: True
[Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 0.2s finished
默认情况下,CountVectorizer和TfidfVectorizer都需要字符串或字节类型的项序列。在您的管道中,CountVectorizer接收语料库并使用scipy.sparse.csr_矩阵向TfidfVectorizer输出计数的稀疏表示。由于TfidfVectorizer的输入不是预期的类型,因此会出现类型错误“TypeError:expected string或bytes like object”。如果使用其中一个而不是两个矢量器,则管道可以工作。比如说,
pipeline = Pipeline([
#('vect', CountVectorizer(analyzer='char',lowercase=False)),
('tfidf', TfidfVectorizer(analyzer='char',lowercase=False)),
('clf', LogisticRegression())
])
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
#'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000, 50000),
#'vect__ngram_range': [(1, 1), (1, 2)], # unigrams or bigrams
'tfidf__use_idf': [True, False],
'tfidf__norm': ['l1', 'l2'],
'clf__max_iter': [1000],
'clf__C': [0.00001, 0.000001],
'clf__penalty': ['l2'],
# 'clf__max_iter': (10, 50, 80),
}
生成以下输出:
50 50
Performing grid search...
pipeline: ['tfidf', 'clf']
parameters:
{'clf__C': [1e-05, 1e-06],
'clf__max_iter': [1000],
'clf__penalty': ['l2'],
'tfidf__norm': ['l1', 'l2'],
'tfidf__use_idf': [True, False]}
Fitting 5 folds for each of 8 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
done in 0.347s
Best score: 0.680
Best parameters set:
clf__C: 1e-05
clf__max_iter: 1000
clf__penalty: 'l2'
tfidf__norm: 'l1'
tfidf__use_idf: True
[Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 0.2s finished
太好了,太好了。