Scikit learn CountVectorizer提供空词汇错误是文档是基数_Scikit Learn

Scikit learn CountVectorizer提供空词汇错误是文档是基数

scikit-learn

Scikit learn CountVectorizer提供空词汇错误是文档是基数,scikit-learn,Scikit Learn,我在将sklearn CountVectorizer用于包含单词“one”的文档时遇到了一个问题。我已经计算出，当文档只包含POS标记CD（基数）的单词时，就会发生错误。以下文档都会导致空词汇错误： [‘一’、‘二’] [‘一百’ ngram_code=1 cv = CountVectorizer(stop_words='english', analyzer='word', lowercase=True,\ token_pattern="[\w']+", ngram_range=(ngram_c

我在将sklearn CountVectorizer用于包含单词“one”的文档时遇到了一个问题。我已经计算出，当文档只包含POS标记CD（基数）的单词时，就会发生错误。以下文档都会导致空词汇错误： [‘一’、‘二’] [‘一百’

ngram_code=1
cv = CountVectorizer(stop_words='english', analyzer='word', lowercase=True,\
token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code))
cv_array = cv.fit_transform(['one', 'two'])

获取错误： ValueError：空词汇表；也许文档中只包含停止词

以下内容不会导致错误，因为（我认为）基数词与其他词混合在一起： [‘一’、‘二’、‘人’]

有趣的是，在这种情况下，词汇表中只添加了“人”，而没有添加“一”、“二”：

cv_array = cv.fit_transform(['one', 'two', 'people'])
cv.vocabulary_
Out[143]: {'people': 0}

作为单字文档的另一个示例，['hello']工作正常，因为它不是基数：

cv_array = cv.fit_transform(['hello'])
cv.vocabulary_
Out[147]: {'hello': 0}

因为像‘一’、‘二’这样的词不是停止词，我希望它们能被CountVectorizer处理。我如何处理这些单词

另外：我还得到了“系统”这个词同样的错误。为什么这个词会出错

cv_array = cv.fit_transform(['system'])

ValueError：空词汇表；可能文档只包含停止词

它们之所以会得到空词汇表，是因为这些词属于sklearn使用的停止词列表。您可以通过以下方式检查列表或测试：

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

>>> 'one' in ENGLISH_STOP_WORDS 
True

>>> 'two' in ENGLISH_STOP_WORDS 
True

>>> 'system' in ENGLISH_STOP_WORDS 
True

如果您想处理这些单词，只需按如下方式初始化CountVectorizer：

cv = CountVectorizer(stop_words=None, ...

非常感谢你的回答。它启发了我。当然，CountVectorizer函数将使用sklearn停止字。我错误地看到了nltk停止词，这是我在调用CountVectorizer之前分别用于标记化的词。我在做：从nltk.corpus import stopwords stop_words=set（stopwords.words（'english'））“one”in stop_words Out[7]：False，这就是我感到困惑的原因。然而，有趣的是nltk stop_单词和sklearn stop_单词是不同的！非常感谢。很高兴能帮上忙，你真的运气不好，你测试的几乎所有单词都在sklearn的stopword集合中。