Python sklearn CountVectorizer输出具有空行的矩阵_Python_Numpy_Scikit Learn

Python sklearn CountVectorizer输出具有空行的矩阵

python numpy scikit-learn

Python sklearn CountVectorizer输出具有空行的矩阵,python,numpy,scikit-learn,Python,Numpy,Scikit Learn,我正在使用CountVectorizer为每个文档生成向量。在我的例子中，文档是由1-5个单词组成的短文本 for i, doc in enumerate(documents): if doc: # make sure there is no empty document. corpus.append(doc) countVectorizer = CountVectorizer() weight_arr = countVectorizer.fit_transform(c

我正在使用CountVectorizer为每个文档生成向量。在我的例子中，文档是由1-5个单词组成的短文本

for i, doc in enumerate(documents):
    if doc: # make sure there is no empty document.
        corpus.append(doc)

countVectorizer = CountVectorizer()
weight_arr = countVectorizer.fit_transform(corpus)

for doc_index, count_vector in enumerate(weight_arr):
    nonzero_feature_indice = count_vector.nonzero()[1] # [1]: unique column index
    if nonzero_feature_indice.size == 0:
        print "EMPTY ROW!"

我使用CountVectorizer的默认参数。我不会删除stopwords并设置任何可能生成空文档的阈值

{'binary': False, 'lowercase': True, 'stop_words': None, 'decode_error': u'strict', 'vocabulary': None, 'tokenizer': None, 'encoding': u'utf-8', 'dtype': <type 'numpy.int64'>, 'analyzer': u'word', 'ngram_range': (1, 1), 'max_df': 1.0, 'min_df': 1, 'max_features': None, 'input': u'content', 'strip_accents': None, 'token_pattern': u'(?u)\\b\\w\\w+\\b', 'preprocessor': None}

{'binary'：False，'lowercase'：True，'stop_words'：None，'decode_error'：u'strict'，'词汇'：None，'tokenizer'：None，'encoding'：u'utf-8'，'dtype'：，'analyzer'：u'word'，'ngram_range'：（1,1），'max_-df'：1，'max_-features'：None，'input'，'u'content'，'strip_-accents'：None，'\\b\\w\\w+\\b，“预处理器”：无}

我发现权重中的几行都是零。为什么这是可能的

通过您的设置，只有一个字母单词的文档将提供所有零数组。您的

标记器正在过滤掉一个字母的单词
您没有指定任何标记，但默认使用以下标记模式：
'token_pattern': u'(?u)\\b\\w\\w+\\b'

如果要允许使用单字母单词，可以将其更改为：
'token_pattern': u'(?u)\\b\\w+\\b'

您只需将其传递给构造函数：
countVectorizer = CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b')

它应该可以工作。
您的文档是否可能只有一个字母单词？标记器会过滤掉这些内容。@user3914041是。我认为这是可能的。您知道如何禁用删除一个字母吗？