Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python sklearn CountVectorier返回全零-字符串转换问题?_Python_Python 2.7_Pandas_Scikit Learn_Countvectorizer - Fatal编程技术网

Python sklearn CountVectorier返回全零-字符串转换问题?

Python sklearn CountVectorier返回全零-字符串转换问题?,python,python-2.7,pandas,scikit-learn,countvectorizer,Python,Python 2.7,Pandas,Scikit Learn,Countvectorizer,我正在尝试使用sklearn的CountVectorizer与给定的词汇表。我的词汇是: ['humanitarian crisis', 'vacations for the anti-cruise crowd', 'school textbook', "b'cruise vacations for the anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations fo

我正在尝试使用sklearn的CountVectorizer与给定的词汇表。我的词汇是:

['humanitarian crisis', 'vacations for the anti-cruise crowd', 'school textbook', "b'cruise vacations for the anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations for the anti-cruise', "b'cruise vacations for the anti-cruise crowd"]
要进行矢量化的输入来自一个数据帧。我是从一个csv中读到这篇文章的,带有
pd.read\u csv
encoding='utf8'

29371            b'9 quirky and brilliant paris boutiques'
20525    b'public school textbook filled with muslim bi...
2871     b'congress focuses on averting shutdown, but t...
29902    b'yarmouk siege: u.n. announces trip to syria ...
45596    b'fracking protesters arrested for gluing them...
6266         b'cruise vacations for the anti-cruise crowd'
调用CountVectorizer(词汇表=词汇表).fit_transform()后,我得到一个全零矩阵:

(<6x10 sparse matrix of type '<type 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>, <class 'scipy.sparse.csr.csr_matrix'>)
(,)

这是因为字符串类型造成的问题,还是我如何调用CountVectorizer的问题?我不知道如何转换字符串类型;我在python2.7和pandas中尝试了多次调用
encode
decode
。如果您有任何建议,我们将不胜感激。

在调用CountVectorizer时使用“ngram\u range=(min\u word\u count,max\u word\u count)”。

如果您将iris数据集转储到csv中,然后使用您的代码读入并拟合转换物种列,是否会出现相同的错误?显示完整的代码。什么是
词汇表
?如何将数据传递到
fit_transform()
词汇表是由
countvectorier
学习的单个单词,或者是按空格(“”)分割输入文档时要使用的单词。因此,恐怕您的词汇表(包含短语而非单词)与给定数据中的任何单词都不匹配,因此结果为0个元素。阅读词汇表的工作原理。你能解释一下如何计算最小单词数、最大单词数吗