Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 2.7 vectorizer.fit_变换提供NotImplementedError:不支持向稀疏矩阵添加非零标量_Python 2.7_Numpy_Scipy_Scikit Learn - Fatal编程技术网

Python 2.7 vectorizer.fit_变换提供NotImplementedError:不支持向稀疏矩阵添加非零标量

Python 2.7 vectorizer.fit_变换提供NotImplementedError:不支持向稀疏矩阵添加非零标量,python-2.7,numpy,scipy,scikit-learn,Python 2.7,Numpy,Scipy,Scikit Learn,我正在尝试使用自定义分析器创建一个术语文档矩阵,以从文档中提取特征。以下是相同的代码: vectorizer = CountVectorizer( \ ngram_range=(1,2), ) analyzer=vectorizer.build_analyzer() def customAnalyzer(text): grams = analyzer(text) tgram

我正在尝试使用自定义分析器创建一个术语文档矩阵,以从文档中提取特征。以下是相同的代码:

vectorizer = CountVectorizer(  \
                           ngram_range=(1,2),  
                        )
analyzer=vectorizer.build_analyzer()
def customAnalyzer(text):
    grams = analyzer(text)
    tgrams = [gram for gram in grams if not re.match("^[0-9\s]+$",gram)]
    return tgrams
调用此函数以创建自定义分析器,countVectorizer使用该分析器提取特征

for i in xrange( 0, num_rows ):
    clean_query.append( review_to_words( inp["keyword"][i] , units))
vectorizer = CountVectorizer(analyzer = customAnalyzer,   \
                            tokenizer = None,    \
                            ngram_range=(1,2),  \
                            preprocessor = None, \
                            stop_words = None,   \
                            max_features = n,
                           )      
features = vectorizer.fit_transform(clean_query)
z = vectorizer.get_feature_names()
此调用引发以下错误:

(<type 'exceptions.NotImplementedError'>, 'python.py', 128,NotImplementedError('adding a nonzero scalar to a sparse matrix is not supported',))

这是一个小测试,我做了重现错误,但它没有抛出相同的错误给我。(此示例取自:)

scikit学习版本:0.19.dev0

In [1]: corpus = [
   ...: ...     'This is the first document.',
   ...: ...     'This is the second second document.',
   ...: ...     'And the third one.',
   ...: ...     'Is this the first document?',
   ...: ... ]

In [2]: from sklearn.feature_extraction.text import TfidfVectorizer

In [3]: vectorizer = TfidfVectorizer(min_df=1)

In [4]: vectorizer.fit_transform(corpus)
Out[4]: 
<4x9 sparse matrix of type '<type 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse Row format>

In [5]: import numpy as np

In [6]: np.isscalar(corpus)
Out[6]: False

In [7]: type(corpus)
Out[7]: list
[1]中的
:语料库=[
…:…“这是第一份文件。”,
…:…“这是第二份文件。”,
…:…“还有第三个。”,
…:…“这是第一份文件吗?”,
...: ... ]
在[2]中:从sklearn.feature\u extraction.text导入TfidfVectorizer
[3]中:矢量器=TFIDF矢量器(最小值df=1)
在[4]中:矢量器.fit_变换(语料库)
出[4]:
在[5]中:将numpy作为np导入
在[6]中:np.isscalar(语料库)
Out[6]:假
在[7]:类型(语料库)
Out[7]:列表
从上面的代码可以看出,语料库不是标量,而是类型列表


我认为您的解决方案在于创建
clean\u query
变量,正如
vectorizer.fit\u transform
函数所期望的那样。

发布数据,以便复制错误。
In [1]: corpus = [
   ...: ...     'This is the first document.',
   ...: ...     'This is the second second document.',
   ...: ...     'And the third one.',
   ...: ...     'Is this the first document?',
   ...: ... ]

In [2]: from sklearn.feature_extraction.text import TfidfVectorizer

In [3]: vectorizer = TfidfVectorizer(min_df=1)

In [4]: vectorizer.fit_transform(corpus)
Out[4]: 
<4x9 sparse matrix of type '<type 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse Row format>

In [5]: import numpy as np

In [6]: np.isscalar(corpus)
Out[6]: False

In [7]: type(corpus)
Out[7]: list