Python 2.7 vectorizer.fit_变换提供NotImplementedError:不支持向稀疏矩阵添加非零标量
我正在尝试使用自定义分析器创建一个术语文档矩阵,以从文档中提取特征。以下是相同的代码:Python 2.7 vectorizer.fit_变换提供NotImplementedError:不支持向稀疏矩阵添加非零标量,python-2.7,numpy,scipy,scikit-learn,Python 2.7,Numpy,Scipy,Scikit Learn,我正在尝试使用自定义分析器创建一个术语文档矩阵,以从文档中提取特征。以下是相同的代码: vectorizer = CountVectorizer( \ ngram_range=(1,2), ) analyzer=vectorizer.build_analyzer() def customAnalyzer(text): grams = analyzer(text) tgram
vectorizer = CountVectorizer( \
ngram_range=(1,2),
)
analyzer=vectorizer.build_analyzer()
def customAnalyzer(text):
grams = analyzer(text)
tgrams = [gram for gram in grams if not re.match("^[0-9\s]+$",gram)]
return tgrams
调用此函数以创建自定义分析器,countVectorizer使用该分析器提取特征
for i in xrange( 0, num_rows ):
clean_query.append( review_to_words( inp["keyword"][i] , units))
vectorizer = CountVectorizer(analyzer = customAnalyzer, \
tokenizer = None, \
ngram_range=(1,2), \
preprocessor = None, \
stop_words = None, \
max_features = n,
)
features = vectorizer.fit_transform(clean_query)
z = vectorizer.get_feature_names()
此调用引发以下错误:
(<type 'exceptions.NotImplementedError'>, 'python.py', 128,NotImplementedError('adding a nonzero scalar to a sparse matrix is not supported',))
这是一个小测试,我做了重现错误,但它没有抛出相同的错误给我。(此示例取自:)
scikit学习版本:0.19.dev0
In [1]: corpus = [
...: ... 'This is the first document.',
...: ... 'This is the second second document.',
...: ... 'And the third one.',
...: ... 'Is this the first document?',
...: ... ]
In [2]: from sklearn.feature_extraction.text import TfidfVectorizer
In [3]: vectorizer = TfidfVectorizer(min_df=1)
In [4]: vectorizer.fit_transform(corpus)
Out[4]:
<4x9 sparse matrix of type '<type 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
In [5]: import numpy as np
In [6]: np.isscalar(corpus)
Out[6]: False
In [7]: type(corpus)
Out[7]: list
[1]中的:语料库=[
…:…“这是第一份文件。”,
…:…“这是第二份文件。”,
…:…“还有第三个。”,
…:…“这是第一份文件吗?”,
...: ... ]
在[2]中:从sklearn.feature\u extraction.text导入TfidfVectorizer
[3]中:矢量器=TFIDF矢量器(最小值df=1)
在[4]中:矢量器.fit_变换(语料库)
出[4]:
在[5]中:将numpy作为np导入
在[6]中:np.isscalar(语料库)
Out[6]:假
在[7]:类型(语料库)
Out[7]:列表
从上面的代码可以看出,语料库不是标量,而是类型列表
我认为您的解决方案在于创建
clean\u query
变量,正如vectorizer.fit\u transform
函数所期望的那样。发布数据,以便复制错误。
In [1]: corpus = [
...: ... 'This is the first document.',
...: ... 'This is the second second document.',
...: ... 'And the third one.',
...: ... 'Is this the first document?',
...: ... ]
In [2]: from sklearn.feature_extraction.text import TfidfVectorizer
In [3]: vectorizer = TfidfVectorizer(min_df=1)
In [4]: vectorizer.fit_transform(corpus)
Out[4]:
<4x9 sparse matrix of type '<type 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
In [5]: import numpy as np
In [6]: np.isscalar(corpus)
Out[6]: False
In [7]: type(corpus)
Out[7]: list