Python 如何将大型稀疏矩阵转换为数组（详细信息如下）？_Python_Arrays_Python 2.7_Scikit Learn_Sparse Matrix

Python 如何将大型稀疏矩阵转换为数组（详细信息如下）？

python arrays python-2.7 scikit-learn

Python 如何将大型稀疏矩阵转换为数组（详细信息如下）？,python,arrays,python-2.7,scikit-learn,sparse-matrix,Python,Arrays,Python 2.7,Scikit Learn,Sparse Matrix,我有一个稀疏的特征矩阵，它是使用sklearn进行以下操作的结果： from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,preprocessor = None,stop_words = None,max_features = 5000) train_data_features = vectoriz

我有一个稀疏的特征矩阵，它是使用sklearn进行以下操作的结果：

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,preprocessor = None,stop_words = None,max_features = 5000) 

train_data_features = vectorizer.fit_transform(y)

转换为连续数组表示将具体化内存中的所有零，结果大小为：

train_data_features.shape[0] * train_data_features.shape[1] * train_data_features.dtype.itemsize / 1e6

由此得出：`6242.4

这是8GB，相比之下，原始稀疏表示小于1MB。那么，如何解决这个问题，以便我能够有效地将生成的数组拟合到随机林分类器中呢

`试试这个：

m = np.memmap('train_data_features_dense.mmap', dtype=train_data_features.dtype, mode='w+', shape=train_data_features.shape)
train_data_features.todense(out=m)
# Some work with m here, if you want, reading, writing, etc
# Better to call delete when you've done all work with it, del will flush buffers automatically
del m
# If you want to load memmap in another script
m = np.memmap('train_data_features_dense.mmap', dtype=train_data_features.dtype, mode='r+', shape=train_data_features.shape)

但正如@yangjie上面所说的，您应该尽可能对稀疏矩阵进行操作。

scikit中的大多数模型，包括RandomForestClassifier，都接受稀疏矩阵作为输入。您可以直接使用稀疏表示的数据进行拟合。嗯，我确实在随机林分类器中输入了它，但它需要花费大量时间才能完成，因此必须中途中断内核python。无论如何，为了更快地实现它，也许？@DiscoDancer，您是否尝试在随机林分类器中设置n_jobs=-1？在你的位置上，我会尝试在数据集的一小个子集上学习它，看看fit是否能在你的配置/输入上正常工作，然后如果你不能等到fit完成，即使n_jobs=-1，你也可以尝试使用FeatureHashing/PCA/ICA压缩特征空间。因为我认为解决这个问题比在8gb矩阵上操作要容易得多，速度也要快得多。@DiscoDancer，是的，导入numpy，因为npIts只给出一个全零的矩阵。@DiscoDancer，在你将它输入到todense方法之后？也许是因为你看不到这个大矩阵的所有值？尝试在某个维度上求和或计算所有非零出现次数。我是否应该将此m值输入分类器？@disconcer，是的，您可以像使用传统矩阵一样使用它，它将占用尽可能多的RAM，并将数据从文件缓慢加载到RAM中，但不要尝试将其复制到RAM中。