Python 向CountVectorizer（sklearn）添加词干分析支持_Python_Nlp_Scikit Learn

Python 向CountVectorizer（sklearn）添加词干分析支持

python nlp scikit-learn

Python 向CountVectorizer（sklearn）添加词干分析支持,python,nlp,scikit-learn,Python,Nlp,Scikit Learn,我正在尝试使用sklearn将词干添加到NLP中的管道中 from nltk.stem.snowball import FrenchStemmer stop = stopwords.words('french') stemmer = FrenchStemmer() class StemmedCountVectorizer(CountVectorizer): def __init__(self, stemmer): super(StemmedCountVectoriz

我正在尝试使用sklearn将词干添加到NLP中的管道中

from nltk.stem.snowball import FrenchStemmer

stop = stopwords.words('french')
stemmer = FrenchStemmer()


class StemmedCountVectorizer(CountVectorizer):
    def __init__(self, stemmer):
        super(StemmedCountVectorizer, self).__init__()
        self.stemmer = stemmer

    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc:(self.stemmer.stem(w) for w in analyzer(doc))

stem_vectorizer = StemmedCountVectorizer(stemmer)
text_clf = Pipeline([('vect', stem_vectorizer), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='linear', C=1)) ])

当将此管道与sklearn的CountVectorizer一起使用时，它可以工作。如果我手动创建这样的功能，它也可以工作。

vectorizer = StemmedCountVectorizer(stemmer)
vectorizer.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)

编辑：

如果我在我的IPython笔记本上尝试这个管道，它会显示[*]，但什么也不会发生。当我查看终端时，它给出以下错误：

Process PoolWorker-12:
Traceback (most recent call last):
  File "C:\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Anaconda2\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Anaconda2\lib\multiprocessing\pool.py", line 102, in worker
    task = get()
  File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 360, in get
    return recv()
AttributeError: 'module' object has no attribute 'StemmedCountVectorizer'

范例

下面是完整的示例

from sklearn.pipeline import Pipeline
from sklearn import grid_search
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.stem.snowball import FrenchStemmer

stemmer = FrenchStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemming(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

X = ['le chat est beau', 'le ciel est nuageux', 'les gens sont gentils', 'Paris est magique', 'Marseille est tragique', 'JCVD est fou']
Y = [1,0,1,1,0,0]

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC())])
parameters = { 'vect__analyzer': ['word', stemming]}

gs_clf = grid_search.GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf.fit(X, Y)

如果从参数中删除词干，它将起作用，否则将不起作用

更新：

问题似乎出现在并行化过程中，因为当删除n_作业=-1时，问题消失。

您可以尝试：

def build_analyzer(self):
    analyzer = super(CountVectorizer, self).build_analyzer()
    return lambda doc:(stemmer.stem(w) for w in analyzer(doc))

并删除

\uuuu init\uuuu

方法。

您可以将可调用的as

分析器

传递给

计数向量器

构造函数，以提供自定义分析器。这似乎对我有用

from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import FrenchStemmer

stemmer = FrenchStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

stem_vectorizer = CountVectorizer(analyzer=stemmed_words)
print(stem_vectorizer.fit_transform(['Tu marches dans la rue']))
print(stem_vectorizer.get_feature_names())

打印出：

  (0, 4)    1
  (0, 2)    1
  (0, 0)    1
  (0, 1)    1
  (0, 3)    1
[u'dan', u'la', u'march', u'ru', u'tu']

我知道我发布我的答案有点晚了。但它在这里，以防有人仍然需要帮助

下面是通过重写

build\u analyzer（）

你可以在你的

矢量器对象上自由调用CountVectorizer类的fit
和transform
函数
它不工作（给出相同的错误），我需要词干分析器属性。你能给出关于打印错误的更多信息吗？例如，哪一行断开了？我使用了一个网格搜索，n_jobs=-1来并行化工作。问题可能是由于lambda
函数不可拾取。只需将lambda
函数替换为def
function.parameters={'vect\uu analyzer'：['word'，steming]}将其用作gridsearch的参数会出现错误：AttributeError:'module'对象没有属性'steming'，如果我们重写analyzer参数，并且它不再是'word'的默认值。是否如本文文档中所述禁用了标记器和stopword参数：在何种情况下，它需要在同一个analyzer函数中实现？这似乎是pickle和unpickle范围的问题。例如，如果您将词干分析
放在导入的模块中，则可以更可靠地取消勾选。请提供一个示例或链接以了解您所说的内容？如何在导入的模块中添加“词干”？因为没有并行化，GridSearch非常慢，只有几个参数需要调整。但我的意思是将词干分析的代码移动到myutils.py
中，然后从myutils导入词干分析使用。是的，它终于起作用了。你能修改一下你的答案让我接受吗？因为这真的是我的问题。你能不能先澄清一下你是如何运行代码的，这样它就不起作用了。您是否将其输入交互式控制台、IDLE、ipython笔记本、运行脚本等？谢谢。我运行了这段代码，词干分析器工作正常，但stop_words arg下提供的自定义stop_单词不再工作。有解决办法吗？@Ramya是的，有解决办法：从nltk.corpus导入stopwords StemedCountVectorier（…，stop_words=stopwords.words（'french'））@ChirazBenAbdelkader这不会删除stopwords。如所述，stop_words
参数仅适用于analyzer==word
在传递stopwords之前是否应阻止它们？我的意思是，停止字是在应用分析器之前还是之后过滤的？我想我找到了它（如果我错了，请纠正我）：，当通过重写build\u analyzer添加时，停止字删除后会进行词干处理，因此停止字的词干处理毫无意义
from sklearn.feature_extraction.text import CountVectorizer
import nltk.stem

french_stemmer = nltk.stem.SnowballStemmer('french')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)])

vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')