Python 如何仅选择scipy.sparse csr_矩阵的某些行？_Python_Pandas_Scipy_Sparse Matrix

Python 如何仅选择scipy.sparse csr_矩阵的某些行？

python pandas

Python 如何仅选择scipy.sparse csr_矩阵的某些行？,python,pandas,scipy,sparse-matrix,Python,Pandas,Scipy,Sparse Matrix,下面是一个从Pandas数据帧过滤行的示例，先是密集的，然后是稀疏的 import pandas as pd from scipy.sparse import csr_matrix df = pd.DataFrame({'thing': [1, 1, 2, 2, 2], 'score': [0.12, 0.13, 0.14, 0.15, 0.17]}) row_index = df['thing'] == 1 print(type(row_index),

下面是一个从Pandas数据帧过滤行的示例，先是密集的，然后是稀疏的

import pandas as pd
from scipy.sparse import csr_matrix

df = pd.DataFrame({'thing': [1, 1, 2, 2, 2],
                   'score': [0.12, 0.13, 0.14, 0.15, 0.17]})

row_index = df['thing'] == 1
print(type(row_index), row_index)
print(df[row_index])
sdf = csr_matrix(df)
print(sdf[row_index])

第二次打印只返回前两行。第三次打印返回错误（请参阅下面的完整结果）

如何修复此代码以按行索引正确过滤csr_矩阵的行，而不使其成为密集矩阵？在我的真实例子中，我有一个TF/IDF矢量器的结果，所以它有数千列，我不想让它太密集

我发现了一些问题，但我不知道答案是否存在

我使用的是pandas 0.25.3和scipy 1.3.2

以上代码的完整输出：

<class 'pandas.core.series.Series'> 0     True
1     True
2    False
3    False
4    False
Name: thing, dtype: bool
   thing  score
0      1   0.12
1      1   0.13
Traceback (most recent call last):
  File "./foo.py", line 13, in <module>
    print(sdf[row_index])
  File "root/.venv/lib/python3.7/site-packages/scipy/sparse/_index.py", line 59, in __getitem__
    return self._get_arrayXslice(row, col)
  File "root/.venv/lib/python3.7/site-packages/scipy/sparse/csr.py", line 325, in _get_arrayXslice
    return self._major_index_fancy(row)._get_submatrix(minor=col)
  File "root/.venv/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 690, in _major_index_fancy
    np.cumsum(row_nnz[idx], out=res_indptr[1:])
  File "<__array_function__ internals>", line 6, in cumsum
  File "root/.venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2423, in cumsum
    return _wrapfunc(a, 'cumsum', axis=axis, dtype=dtype, out=out)
  File "root/.venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
    return bound(*args, **kwds)
ValueError: provided out is the wrong size for the reduction

0 True
1正确
2错误
3错误
4错误
名称：thing，数据类型：bool
记分
0      1   0.12
1      1   0.13
回溯（最近一次呼叫最后一次）：
文件“/foo.py”，第13行，在
打印（sdf[行索引]）
文件“root/.venv/lib/python3.7/site packages/scipy/sparse/_index.py”，第59行，在uu getitem中__
返回self.\u获取\u arrayXslice（行、列）
文件“root/.venv/lib/python3.7/site packages/scipy/sparse/csr.py”，第325行，位于
返回self.\u major\u index\u fancy（行）。\u get\u submatrix（minor=col）
文件“root/.venv/lib/python3.7/site packages/scipy/sparse/compressed.py”，第690行，在索引中
np.cumsum（row_nnz[idx]，out=res_indptr[1:]）
文件“”，第6行，以总和表示
文件“root/.venv/lib/python3.7/site packages/numpy/core/fromneric.py”，第2423行，单位为cumsum
返回_wrapfunc（一个'cumsum'，axis=axis，dtype=dtype，out=out）
文件“root/.venv/lib/python3.7/site packages/numpy/core/fromneric.py”，第61行，在_wrapfunc中
返回边界（*args，**kwds）
ValueError：提供的是错误的缩小尺寸

编辑：这取决于scipy版本。我向scipy提交了申请

等效数组用作布尔索引：

In [180]: row_index.values                                                      
Out[180]: array([ True,  True, False, False, False])
In [181]: sdf[_]                                                                
Out[181]: 
<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [182]: _.A                                                                   
Out[182]: 
array([[1.  , 0.12],
       [1.  , 0.13]])

但是稀疏矩阵不是

ndarray

的子类。它在许多方面都很相似，但始终使用自己的代码。

您的代码在我的系统上运行良好（pandas 0.25，scipy 1.1.0）@QuangHoang哇。所以scipy版本之间发生了一些变化？是的，scipy 1.1.0、1.2.0、1.2.1工作、1.3.0、1.3.1、1.3.2不工作。为什么要使用

pandas.Series

作为掩码

scipy.sparse

是建立在

numpy

的基础上的，而不是

pandas

。我认为答案是使用

行索引值进行索引。如果你在这个答案的顶部加上这个，我会接受的。我的观点是，您提供了很多有用的信息，但是很难判断四个代码示例中哪一个有答案。
In [179]: row_index                                                             
Out[179]: 
0     True
1     True
2    False
3    False
4    False
Name: thing, dtype: bool

In [180]: row_index.values                                                      
Out[180]: array([ True,  True, False, False, False])
In [181]: sdf[_]                                                                
Out[181]: 
<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [182]: _.A                                                                   
Out[182]: 
array([[1.  , 0.12],
       [1.  , 0.13]])

In [185]: (sdf.A)[row_index]                                                    
Out[185]: 
array([[1.  , 0.12],
       [1.  , 0.13]])