Python CountVectorizer不打印词汇表

Python CountVectorizer不打印词汇表,python,numpy,scikit-learn,scipy,countvectorizer,Python,Numpy,Scikit Learn,Scipy,Countvectorizer,我已经安装了python 2.7、numpy 1.9.0、scipy 0.15.1和scikit学习0.15.2。 现在,当我在python中执行以下操作时: train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") from sklearn.feature_e

我已经安装了python 2.7、numpy 1.9.0、scipy 0.15.1和scikit学习0.15.2。 现在,当我在python中执行以下操作时:

train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

print vectorizer


    CountVectorizer(analyzer=u'word', binary=False, charset=None,
    charset_error=None, decode_error=u'strict',
    dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
    lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ngram_range=(1, 1), preprocessor=None, stop_words=None,
    strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
    tokenizer=None, vocabulary=None)

     vectorizer.fit_transform(train_set)
    print vectorizer.vocabulary

    None.
以上代码来自博客:

你能帮我解释一下为什么我会犯这样的错误吗。由于词汇表没有正确编入索引,我无法进一步理解TF-IDF的概念。我是python的新手,因此任何帮助都将不胜感激


弧。

如果缺少下划线,请尝试以下方法:

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
print vectorizer.vocabulary_
# {u'blue': 0, u'sun': 3, u'bright': 1, u'sky': 2}

如果使用ipython shell,则可以使用制表符完成,并且可以更轻松地找到对象的方法和属性。

尝试使用
矢量器。get_feature_names()
方法。它按照列名称在
文档\u term\u矩阵中出现的顺序给出列名称

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
vectorizer.get_feature_names()
#> ['blue', 'bright', 'sky', 'sun']

对谢谢我将尝试使用ipython shell,这样我就不会错过这样的选项卡完成。我听说伊普顿·谢尔从来不知道这件事。谢谢你提供的信息。在上面我还问了为什么CountVectorize中我的stop words=None不应该是这种情况。stop_words的默认值是None。如果您想使用内置的英文停止词,可以这样创建向量器:vectorizer=CountVectorizer(stop_words='english')。谢谢。我认为停止词是内置在函数中的。有一点需要澄清的是,词汇表函数是否按字母顺序生成索引词。i、 e.在上面的示例中,“蓝色”得到0,1按字母顺序被赋予“明亮”下一个术语?
vectorizer。带下划线的词汇表
就是您想要的<代码>矢量器。词汇表
不是你想要的,它是你传入的词汇表,如果有的话(通常没有)。
from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
vectorizer.get_feature_names()
#> ['blue', 'bright', 'sky', 'sun']