Python 在sklearn管道中使用spacy作为标记器
我试图在一个更大的scikit学习管道中使用spacy作为标记器,但始终遇到一个问题,即任务不能被酸洗以发送给工人 最简单的例子:Python 在sklearn管道中使用spacy作为标记器,python,scikit-learn,spacy,Python,Scikit Learn,Spacy,我试图在一个更大的scikit学习管道中使用spacy作为标记器,但始终遇到一个问题,即任务不能被酸洗以发送给工人 最简单的例子: from sklearn.linear_model import SGDClassifier from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import Ran
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import fetch_20newsgroups
from functools import partial
import spacy
def spacy_tokenize(text, nlp):
return [x.orth_ for x in nlp(text)]
nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
tok = partial(spacy_tokenize, nlp=nlp)
pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=tok)),
('clf', SGDClassifier())])
params = {'vectorize__ngram_range': [(1, 2), (1, 3)]}
CV = RandomizedSearchCV(pipeline,
param_distributions=params,
n_iter=2, cv=2, n_jobs=2,
scoring='accuracy')
categories = ['alt.atheism', 'comp.graphics']
news = fetch_20newsgroups(subset='train',
categories=categories,
shuffle=True,
random_state=42)
CV.fit(news.data, news.target)
运行此代码时会出现以下错误:
PicklingError: Could not pickle the task to send it to the workers.
让我困惑的是:
import pickle
pickle.dump(tok, open('test.pkl', 'wb'))
工作没有问题
有人知道spacy是否可以与sklearn交叉验证一起使用吗?
谢谢 这不是一个解决方案,而是一个变通方法。spacy和joblib之间似乎存在一些问题:
- 自定义_文件.py
import spacy nlp = spacy.load('en', disable=['ner', 'parser', 'tagger']) def spacy_tokenizer(doc): return [x.orth_ for x in nlp(doc)]
#Other code ... ... from custom_file import spacy_tokenizer pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=spacy_tokenizer)), ('clf', SGDClassifier())]) ... ...
- main.py
import spacy nlp = spacy.load('en', disable=['ner', 'parser', 'tagger']) def spacy_tokenizer(doc): return [x.orth_ for x in nlp(doc)]
#Other code ... ... from custom_file import spacy_tokenizer pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=spacy_tokenizer)), ('clf', SGDClassifier())]) ... ...