Python sklearn CountVectorZor的词汇表选项和bigrams返回零数组

Python sklearn CountVectorZor的词汇表选项和bigrams返回零数组,python,scikit-learn,n-gram,Python,Scikit Learn,N Gram,我想从一个数组中提取bigram,获取频率大于100的所有bigram,然后使用减少的词汇表为第二个数组打分 看起来词汇选项应该能满足我的需要,但它似乎不起作用。即使将其中一个的输出直接馈送给另一个,也只会产生一个(形状正确的)零数组 from sklearn.feature_extraction.text import CountVectorizer docs = ['run fast into a bush','run fast into a tree','run slow','run f

我想从一个数组中提取bigram,获取频率大于100的所有bigram,然后使用减少的词汇表为第二个数组打分

看起来词汇选项应该能满足我的需要,但它似乎不起作用。即使将其中一个的输出直接馈送给另一个,也只会产生一个(形状正确的)零数组

from sklearn.feature_extraction.text import CountVectorizer

docs = ['run fast into a bush','run fast into a tree','run slow','run fast']

# Collect bigrams
vectorizer = CountVectorizer(ngram_range = (2,2))
vectorizer.fit(docs)
vocab = vectorizer.vocabulary_

# Score the exact same data
vectorizer = CountVectorizer(vocabulary=vocab)
output = vectorizer.transform(docs)

# Demonstrate that the array is all zeros
print "Length of vocab", len(vocab)
print output.A



Length of vocab 5
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]

刚抓住它。您需要在第二个实例中指定ngram_范围(它不会自动解释unigrams和bigrams。)


创建标记化单词的语料库。如果有奇数个单词,将ngram设置为3,否则设置为2表示偶数。创建一个单词包矩阵。使用bag of word矩阵创建一个数据帧,并将矢量器单词作为特征

from sklearn.feature_extraction.text import CountVectorizer
import nltk

docs = ['run fast into a bush','run fast into a tree','run slow','run fast']
str_buffer=" ".join(docs)
#print(str_buffer)
corpus=nltk.word_tokenize(str_buffer)

vectorizer_ng2=CountVectorizer(ngram_range=range(1,3),stop_words='english')
bow_matrix=vectorizer_ng2.fit_transform(corpus)

print(bow_matrix.toarray())
bow_df = pd.DataFrame(bow_matrix.toarray())
bow_df.columns = vectorizer_ng2.get_feature_names()
print(bow_df)
输出:

bush  fast  run  slow  tree
0      0     0    1     0     0
1      0     1    0     0     0
2      0     0    0     0     0
3      0     0    0     0     0
4      1     0    0     0     0
5      0     0    1     0     0
6      0     1    0     0     0
7      0     0    0     0     0
8      0     0    0     0     0
9      0     0    0     0     1
10     0     0    1     0     0
11     0     0    0     1     0
12     0     0    1     0     0
13     0     1    0     0     0
bush  fast  run  slow  tree
0      0     0    1     0     0
1      0     1    0     0     0
2      0     0    0     0     0
3      0     0    0     0     0
4      1     0    0     0     0
5      0     0    1     0     0
6      0     1    0     0     0
7      0     0    0     0     0
8      0     0    0     0     0
9      0     0    0     0     1
10     0     0    1     0     0
11     0     0    0     1     0
12     0     0    1     0     0
13     0     1    0     0     0