Python 2.7 vectorizer.fit_变换提供NotImplementedError:不支持向稀疏矩阵添加非零标量_Python 2.7_Numpy_Scipy_Scikit Learn

Python 2.7 vectorizer.fit_变换提供NotImplementedError:不支持向稀疏矩阵添加非零标量

python-2.7 numpy scikit-learn

Python 2.7 vectorizer.fit_变换提供NotImplementedError:不支持向稀疏矩阵添加非零标量,python-2.7,numpy,scipy,scikit-learn,Python 2.7,Numpy,Scipy,Scikit Learn,我正在尝试使用自定义分析器创建一个术语文档矩阵，以从文档中提取特征。以下是相同的代码： vectorizer = CountVectorizer( \ ngram_range=(1,2), ) analyzer=vectorizer.build_analyzer() def customAnalyzer(text): grams = analyzer(text) tgram

我正在尝试使用自定义分析器创建一个术语文档矩阵，以从文档中提取特征。以下是相同的代码：

vectorizer = CountVectorizer(  \
                           ngram_range=(1,2),  
                        )
analyzer=vectorizer.build_analyzer()
def customAnalyzer(text):
    grams = analyzer(text)
    tgrams = [gram for gram in grams if not re.match("^[0-9\s]+$",gram)]
    return tgrams

调用此函数以创建自定义分析器，countVectorizer使用该分析器提取特征

for i in xrange( 0, num_rows ):
    clean_query.append( review_to_words( inp["keyword"][i] , units))
vectorizer = CountVectorizer(analyzer = customAnalyzer,   \
                            tokenizer = None,    \
                            ngram_range=(1,2),  \
                            preprocessor = None, \
                            stop_words = None,   \
                            max_features = n,
                           )      
features = vectorizer.fit_transform(clean_query)
z = vectorizer.get_feature_names()

此调用引发以下错误：

(<type 'exceptions.NotImplementedError'>, 'python.py', 128,NotImplementedError('adding a nonzero scalar to a sparse matrix is not supported',))

这是一个小测试，我做了重现错误，但它没有抛出相同的错误给我。（此示例取自：）

scikit学习版本：0.19.dev0

In [1]: corpus = [
   ...: ...     'This is the first document.',
   ...: ...     'This is the second second document.',
   ...: ...     'And the third one.',
   ...: ...     'Is this the first document?',
   ...: ... ]

In [2]: from sklearn.feature_extraction.text import TfidfVectorizer

In [3]: vectorizer = TfidfVectorizer(min_df=1)

In [4]: vectorizer.fit_transform(corpus)
Out[4]: 
<4x9 sparse matrix of type '<type 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse Row format>

In [5]: import numpy as np

In [6]: np.isscalar(corpus)
Out[6]: False

In [7]: type(corpus)
Out[7]: list

[1]中的

：语料库=[
…：…“这是第一份文件。”，
…：…“这是第二份文件。”，
…：…“还有第三个。”，
…：…“这是第一份文件吗？”，
...: ... ]
在[2]中：从sklearn.feature\u extraction.text导入TfidfVectorizer
[3]中：矢量器=TFIDF矢量器（最小值df=1）
在[4]中：矢量器.fit_变换（语料库）
出[4]：
在[5]中：将numpy作为np导入
在[6]中：np.isscalar（语料库）
Out[6]：假
在[7]：类型（语料库）
Out[7]：列表

从上面的代码可以看出，语料库不是标量，而是类型列表

我认为您的解决方案在于创建

clean\u query

变量，正如

vectorizer.fit\u transform

函数所期望的那样。

发布数据，以便复制错误。

In [1]: corpus = [
   ...: ...     'This is the first document.',
   ...: ...     'This is the second second document.',
   ...: ...     'And the third one.',
   ...: ...     'Is this the first document?',
   ...: ... ]

In [2]: from sklearn.feature_extraction.text import TfidfVectorizer

In [3]: vectorizer = TfidfVectorizer(min_df=1)

In [4]: vectorizer.fit_transform(corpus)
Out[4]: 
<4x9 sparse matrix of type '<type 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse Row format>

In [5]: import numpy as np

In [6]: np.isscalar(corpus)
Out[6]: False

In [7]: type(corpus)
Out[7]: list