Python 你能在scikit学习中添加计数向量器吗？_Python_Nlp_Scikit Learn

Python 你能在scikit学习中添加计数向量器吗？

python nlp scikit-learn

Python 你能在scikit学习中添加计数向量器吗？,python,nlp,scikit-learn,Python,Nlp,Scikit Learn,我想基于文本语料库在中创建CountVectorizer，然后稍后向CountVectorizer添加更多文本（添加到原始词典）如果我使用transform（），它会保留原来的词汇表，但不会添加新词。如果我使用fit\u transform（），它只是从头开始重新生成词汇表。见下文： In [2]: count_vect = CountVectorizer() In [3]: count_vect.fit_transform(["This is a test"]) Out[3]: <

我想基于文本语料库在中创建CountVectorizer，然后稍后向CountVectorizer添加更多文本（添加到原始词典）

如果我使用

transform（）

，它会保留原来的词汇表，但不会添加新词。如果我使用

fit\u transform（）

，它只是从头开始重新生成词汇表。见下文：

In [2]: count_vect = CountVectorizer()

In [3]: count_vect.fit_transform(["This is a test"])
Out[3]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [4]: count_vect.vocabulary_  
Out[4]: {u'is': 0, u'test': 1, u'this': 2}

In [5]: count_vect.transform(["This not is a test"])
Out[5]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'test': 1, u'this': 2}

In [7]: count_vect.fit_transform(["This not is a test"])
Out[7]: 
<1x4 sparse matrix of type '<type 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>

In [8]: count_vect.vocabulary_
Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}

有什么方法可以做到这一点吗？

在

scikit learn

中实现的算法被设计为一次适应所有数据，这对于大多数ML算法来说是必要的（尽管您描述的应用程序并不有趣），因此没有

更新功能
但是，有一种方法可以通过稍微不同的方式来实现您想要的，请参见下面的代码
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_

哪个输出
{u'this': 2, u'test': 1, u'is': 0}
{u'this': 3, u'test': 2, u'is': 0, u'not': 1}

{u'this': 2, u'test': 1, u'is': 0}
{u'this': 3, u'test': 2, u'is': 0, u'not': 1}