Python scipy.sparse.csr_矩阵行过滤-如何正确实现?

Python scipy.sparse.csr_矩阵行过滤-如何正确实现?,python,numpy,matrix,scipy,sparse-matrix,Python,Numpy,Matrix,Scipy,Sparse Matrix,我正在使用一些scipy.sparse.csr_矩阵。老实说,我手头有一个来自Scikit learn的TfidfVectorizer: vectorizer = TfidfVectorizer(min_df=0.0005) textsMet2 = vectorizer.fit_transform(textsMet) 这是一个矩阵: textsMet2 <999x1632 sparse matrix of type '<class 'numpy.float64'>'

我正在使用一些scipy.sparse.csr_矩阵。老实说,我手头有一个来自Scikit learn的TfidfVectorizer:

vectorizer = TfidfVectorizer(min_df=0.0005)
textsMet2 = vectorizer.fit_transform(textsMet)
这是一个矩阵:

textsMet2
<999x1632 sparse matrix of type '<class 'numpy.float64'>'
    with 5042 stored elements in Compressed Sparse Row format>
并得到一个错误:

文件“D:\Apps\Python\lib\site packages\scipy\sparse\sputils.py”,第327行,在\u boolean\u index\u to\u数组中 提升索引器('无效索引形状') 索引器:索引形状无效

如果删除索引的最后一部分,我会得到一些奇怪的结果:

textsMet2[(textsMet2.sum(axis=1)>0)]
<1x492 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
textsMet2[(textsMet2.sum(轴=1)>0]
为什么它只显示一行矩阵


再一次,我想得到这个矩阵的所有行,它们有任何非零元素。有人知道怎么做吗?

你需要解开你的面具。下面是我目前正在研究的一些代码:

tr_matrix = pipeline.fit_transform(train_text, y_train, **fit_params) # remove documents with too few features to_keep_train = tr_matrix.sum(axis=1) >= config['min_train_features'] to_keep_train = np.ravel(np.array(to_keep_train)) logging.info('%d/%d train documents have enough features', sum(to_keep_train), len(y_train)) tr_matrix = tr_matrix[to_keep_train, :] tr_矩阵=pipeline.fit_变换(序列文本、y_序列、**fit_参数) #删除功能过少的文档 保持列车=tr\u matrix.sum(轴=1)>=config['min\u train\u features'] to_keep_train=np.ravel(np.array(to_keep_train)) logging.info(“%d/%d个序列文档具有足够的功能”, 总和(保持列车),len(y列车)) tr_matrix=tr_matrix[保持训练:]
这有点不雅观,但完成了任务。

很酷,谢谢。我不认为它返回的是应该重新格式化的内容
sum(1)
返回的是
矩阵
,而不是
稀疏矩阵
np.ravel(…)
返回一个数组。有时,您只需要进行实验,看看是什么将稀疏矩阵转换为矩阵再转换为数组,等等。 tr_matrix = pipeline.fit_transform(train_text, y_train, **fit_params) # remove documents with too few features to_keep_train = tr_matrix.sum(axis=1) >= config['min_train_features'] to_keep_train = np.ravel(np.array(to_keep_train)) logging.info('%d/%d train documents have enough features', sum(to_keep_train), len(y_train)) tr_matrix = tr_matrix[to_keep_train, :]