Python CountVectorizer不打印词汇表_Python_Numpy_Scikit Learn_Scipy_Countvectorizer

Python CountVectorizer不打印词汇表

python numpy scikit-learn

Python CountVectorizer不打印词汇表,python,numpy,scikit-learn,scipy,countvectorizer,Python,Numpy,Scikit Learn,Scipy,Countvectorizer,我已经安装了python 2.7、numpy 1.9.0、scipy 0.15.1和scikit学习0.15.2。现在，当我在python中执行以下操作时： train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.") from sklearn.feature_e

我已经安装了python 2.7、numpy 1.9.0、scipy 0.15.1和scikit学习0.15.2。现在，当我在python中执行以下操作时：

train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

print vectorizer


    CountVectorizer(analyzer=u'word', binary=False, charset=None,
    charset_error=None, decode_error=u'strict',
    dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
    lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ngram_range=(1, 1), preprocessor=None, stop_words=None,
    strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
    tokenizer=None, vocabulary=None)

     vectorizer.fit_transform(train_set)
    print vectorizer.vocabulary

    None.

以上代码来自博客：

你能帮我解释一下为什么我会犯这样的错误吗。由于词汇表没有正确编入索引，我无法进一步理解TF-IDF的概念。我是python的新手，因此任何帮助都将不胜感激

弧。

如果缺少下划线，请尝试以下方法：

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
print vectorizer.vocabulary_
# {u'blue': 0, u'sun': 3, u'bright': 1, u'sky': 2}

如果使用ipython shell，则可以使用制表符完成，并且可以更轻松地找到对象的方法和属性。

尝试使用

矢量器。get_feature_names（）

方法。它按照列名称在

文档\u term\u矩阵中出现的顺序给出列名称
from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
vectorizer.get_feature_names()
#> ['blue', 'bright', 'sky', 'sun']

对谢谢我将尝试使用ipython shell，这样我就不会错过这样的选项卡完成。我听说伊普顿·谢尔从来不知道这件事。谢谢你提供的信息。在上面我还问了为什么CountVectorize中我的stop words=None不应该是这种情况。stop_words的默认值是None。如果您想使用内置的英文停止词，可以这样创建向量器：vectorizer=CountVectorizer（stop_words='english'）。谢谢。我认为停止词是内置在函数中的。有一点需要澄清的是，词汇表函数是否按字母顺序生成索引词。i、 e.在上面的示例中，“蓝色”得到0，1按字母顺序被赋予“明亮”下一个术语？vectorizer。带下划线的词汇表
就是您想要的<代码>矢量器。词汇表

不是你想要的，它是你传入的词汇表，如果有的话（通常没有）。

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", 
    "We can see the shining sun, the bright sun.")

vectorizer = CountVectorizer(stop_words='english')
document_term_matrix = vectorizer.fit_transform(train_set)
vectorizer.get_feature_names()
#> ['blue', 'bright', 'sky', 'sun']