Python sklearn CountVectorier返回全零-字符串转换问题？_Python_Python 2.7_Pandas_Scikit Learn_Countvectorizer

Python sklearn CountVectorier返回全零-字符串转换问题？

python python-2.7 pandas scikit-learn

Python sklearn CountVectorier返回全零-字符串转换问题？,python,python-2.7,pandas,scikit-learn,countvectorizer,Python,Python 2.7,Pandas,Scikit Learn,Countvectorizer,我正在尝试使用sklearn的CountVectorizer与给定的词汇表。我的词汇是： ['humanitarian crisis', 'vacations for the anti-cruise crowd', 'school textbook', "b'cruise vacations for the anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations fo

我正在尝试使用sklearn的CountVectorizer与给定的词汇表。我的词汇是：

['humanitarian crisis', 'vacations for the anti-cruise crowd', 'school textbook', "b'cruise vacations for the anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations for the anti-cruise', "b'cruise vacations for the anti-cruise crowd"]

要进行矢量化的输入来自一个数据帧。我是从一个csv中读到这篇文章的，带有

pd.read\u csv

和

encoding='utf8'

：

29371            b'9 quirky and brilliant paris boutiques'
20525    b'public school textbook filled with muslim bi...
2871     b'congress focuses on averting shutdown, but t...
29902    b'yarmouk siege: u.n. announces trip to syria ...
45596    b'fracking protesters arrested for gluing them...
6266         b'cruise vacations for the anti-cruise crowd'

调用CountVectorizer（词汇表=词汇表）.fit_transform（）后，我得到一个全零矩阵：

(<6x10 sparse matrix of type '<type 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>, <class 'scipy.sparse.csr.csr_matrix'>)

（，）

这是因为字符串类型造成的问题，还是我如何调用CountVectorizer的问题？我不知道如何转换字符串类型；我在python2.7和pandas中尝试了多次调用

encode

和

decode

。如果您有任何建议，我们将不胜感激。

在调用CountVectorizer时使用“ngram\u range=（min\u word\u count，max\u word\u count）”。

如果您将iris数据集转储到csv中，然后使用您的代码读入并拟合转换物种列，是否会出现相同的错误？显示完整的代码。什么是

词汇表

？如何将数据传递到

fit_transform（）

词汇表是由

countvectorier

学习的单个单词，或者是按空格（“”）分割输入文档时要使用的单词。因此，恐怕您的词汇表（包含短语而非单词）与给定数据中的任何单词都不匹配，因此结果为0个元素。阅读词汇表的工作原理。你能解释一下如何计算最小单词数、最大单词数吗