Python 3.x MemoryError:无法为稀疏矩阵上形状和数据类型为float64的数组分配GiB_Python 3.x_Numpy_Scipy_Out Of Memory

Python 3.x MemoryError:无法为稀疏矩阵上形状和数据类型为float64的数组分配GiB

python-3.x numpy

Python 3.x MemoryError:无法为稀疏矩阵上形状和数据类型为float64的数组分配GiB,python-3.x,numpy,scipy,out-of-memory,Python 3.x,Numpy,Scipy,Out Of Memory,我处理文本数据，有一个文档术语矩阵，用scipy稀疏矩阵表示（为了提高内存效率）。我已经建立了一个类，在其中我训练了一个主题模型（主题模型的结果是矩阵prob\u word\u给定的主题）目前，我正在使用以下代码对不同的模型进行后期分析： colnames = ['Model', 'Coherence','SVD_values','Min_c0','Max_c0','Min_c1','Max_c1','Min_sv0','Max_sv0','Min_sv1','Max_sv1', 'PWGT

我处理文本数据，有一个文档术语矩阵，用

scipy

稀疏矩阵表示（为了提高内存效率）。我已经建立了一个类，在其中我训练了一个主题模型（主题模型的结果是矩阵

prob\u word\u给定的主题

）

目前，我正在使用以下代码对不同的模型进行后期分析：

colnames = ['Model', 'Coherence','SVD_values','Min_c0','Max_c0','Min_c1','Max_c1','Min_sv0','Max_sv0','Min_sv1','Max_sv1', 'PWGT']
analysis_two_factors = pd.DataFrame(columns=colnames)
directory = 'C:~/Images/'

#Experiment with: singular values, number of topics, weighting methods
for i, top in enumerate(range(3,28,2)):
    for weighting_method in [2,3,4,5,1]:
        print(type(top))
        one_round=[]
        model = FLSA(input_file = data_list, 
                 num_topics = top, 
                 num_words = 20, 
                 word_weighting =weighting_method, 
                 svd_factors=2, 
                 cluster_method='fcm')
        
        model.plot_svd_graph_2D(directory)
        model.plot_cluster_datapoints_graph(directory)
        one_round.append(model.setting)
        one_round.append(model.calc_coherence_value)
        one_round.append(model.s)
        one_round.append(min(model.cluster_centers[:,0]))
        one_round.append(max(model.cluster_centers[:,0]))
        one_round.append(min(model.cluster_centers[:,1]))
        one_round.append(min(model.cluster_centers[:,1]))
        one_round.append(min(model.svd_data[:,0]))
        one_round.append(max(model.svd_data[:,0]))
        one_round.append(min(model.svd_data[:,1]))
        one_round.append(min(model.svd_data[:,1]))
        one_round.append(model.prob_word_given_topic)
        analysis_two_factors.loc[i] = one_round
        print('Finished iteration',str(i))

然而，当我在

top=19

中时，突然出现以下错误：

Traceback (most recent call last):

  File "<ipython-input-687-fe7cf1e4ea7a>", line 15, in <module>
    cluster_method='fcm')

  File "<ipython-input-672-e9c098fb0e45>", line 92, in __init__
    prob_word_given_doc = np.asarray(self.sparse_weighted_matrix / self.sparse_weighted_matrix.sum(1))

  File "c:~\continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 620, in __truediv__
    return self._divide(other, true_divide=True)

  File "c:~\continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 599, in _divide
    return np.true_divide(self.todense(), other)

MemoryError: Unable to allocate 2.87 GiB for an array with shape (4280, 90140) and data type float64

我认为是.todense步骤需要太多内存。您的回溯被切断，但我猜是

FLSA

函数。您没有识别稀疏矩阵，但显然此函数按行和进行缩放。即

M/M.sum（（1）

在密集矩阵上执行，而不是在原始稀疏矩阵上执行。

scipy.sparse

不实现元素稀疏除法。如果此错误是新错误，可能是因为稀疏矩阵比平常大，或者其他原因占用了太多内存。对于这种大小的密集数组，3G是合理的。

prob_word_given_doc = np.asarray(self.sparse_weighted_matrix / self.sparse_weighted_matrix.sum(1))