Python 使用scikit学习执行交叉验证时的MemoryError

Python 使用scikit学习执行交叉验证时的MemoryError,python,numpy,out-of-memory,scikit-learn,cross-validation,Python,Numpy,Out Of Memory,Scikit Learn,Cross Validation,我正在尝试对一些dicom图像进行SVM分类-大小(91109,91))。总共有400张图像,每个受试者要么健康(0),要么扫描呈阳性(1) 我已经编写了一段简单的代码来循环一个目录中的所有DICOM,获取像素计数并将其传递给numpy数组并展平该数组。对于每个dicom文件,我查找一个图像状态为0或1的csv文件,并将其添加到numpy数组中。在这个循环的最后,我有一个2d numpy阵列,在numpy阵列的一行中,每个患者的像素计数和状态 import os import numpy as

我正在尝试对一些dicom图像进行SVM分类-大小(91109,91))。总共有400张图像,每个受试者要么健康(0),要么扫描呈阳性(1)

我已经编写了一段简单的代码来循环一个目录中的所有DICOM,获取像素计数并将其传递给numpy数组并展平该数组。对于每个dicom文件,我查找一个图像状态为0或1的csv文件,并将其添加到numpy数组中。在这个循环的最后,我有一个2d numpy阵列,在numpy阵列的一行中,每个患者的像素计数和状态

import os
import numpy as np
import dicom
import sklearn
import matplotlib.pyplot as plt
from sklearn import decomposition
from sklearn import cross_validation
from sklearn import svm
import csv
import re


dirName = '/home/nm/MachineLearning/DaTSCAN/PPMI/'
# Name of csv file
results_dat='PPMIdatabase.csv'
results_path=os.path.join("/",dirName,results_dat)

# make an empty array that we will populate with dicom image array values
data = []

for filename in os.listdir(dirName):
    dicom_file = os.path.join("/",dirName,filename)  

    if os.path.isfile(dicom_file) and filename.endswith(".dcm"):
        try:
            # check for study in csv file to get diagnosis
            #Get study number from dicom string
            study_id = int(re.search(r'\d+', filename).group())            
            with open(results_path, 'r') as file:
                reader = csv.reader(file)
                search_group = [line[1] for line in reader if line[0] == str(study_id)]
                group = str(search_group[0])
                #HC 0 # PD 1
                if group == 'HC':
                    group_id = 0
                else:
                    group_id = 1
            ds = dicom.read_file(dicom_file)
            img = ds.pixel_array
            a = np.reshape(img,[img.size,1],'C')
            # Add group_id to a
            a = np.insert(a,0,group_id)
            data.append(a)
        except InvalidDicomError:
            print("File %s cannot be opened by dicom.read_file" %(filename))

#make python list to numpy array
full_data = np.array(data)

# Want to predict Y
Y = full_data[:,0] # first row of array is classification status 0 or 1
# Image data
X = full_data[:, 1:]
然后我想运行交叉验证来评估估计器的性能(使用
scikitlearn

但是,我遇到了以下错误,提示内存问题

Traceback (most recent call last):
  File "cross_validation.py", line 56, in <module>
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,Y, test_size=0.4,random_state=0)
  File "/home/nm/.local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1919, in train_test_split
    safe_indexing(a, test)) for a in arrays))
  File "/home/nm/.local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1919, in <genexpr>
    safe_indexing(a, test)) for a in arrays))
  File "/home/nm/.local/lib/python2.7/site-packages/sklearn/utils/__init__.py", line 163, in safe_indexing
    return X.take(indices, axis=0)
MemoryError
回溯(最近一次呼叫最后一次):
文件“cross_validation.py”,第56行,在
X_序列,X_测试,y_序列,y_测试=交叉验证。序列测试分割(X,y,测试大小=0.4,随机状态=0)
文件“/home/nm/.local/lib/python2.7/site packages/sklearn/cross\u validation.py”,第1919行,列车测试分割
安全索引(a,测试)(用于数组中的a)
文件“/home/nm/.local/lib/python2.7/site packages/sklearn/cross_validation.py”,第1919行,在
安全索引(a,测试)(用于数组中的a)
文件“/home/nm/.local/lib/python2.7/site packages/sklearn/utils/_init__.py”,第163行,在安全索引中
返回X.take(索引,轴=0)
记忆者

我的numpy数组太大了。我该如何解决这个问题?

您能确认一下
完整数据。shape
完整数据。dtype
是什么吗?你有多少内存?@ali_m dtype是int16,8Gb的RAMshape是(400902630)。考虑到这些尺寸和数据类型,
X
将是722MB,而
X_train
X_test
将分别消耗433MB和288MB(相比之下,
Y
s非常小)。我看不出有什么理由会在这里遇到
内存错误
,除非在调用
train\u test\u split
时,您的可用内存不足~722MB,或者您对数组的描述不准确。好了,开始吧!无论安装了多少RAM,32位Python进程都无法处理超过2GB的内存。你看到的记忆错误最可能的解释是你达到了这个极限。您应该切换到64位版本的Python(以及64位操作系统,如果您使用的是32位Xubuntu)。
Traceback (most recent call last):
  File "cross_validation.py", line 56, in <module>
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,Y, test_size=0.4,random_state=0)
  File "/home/nm/.local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1919, in train_test_split
    safe_indexing(a, test)) for a in arrays))
  File "/home/nm/.local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1919, in <genexpr>
    safe_indexing(a, test)) for a in arrays))
  File "/home/nm/.local/lib/python2.7/site-packages/sklearn/utils/__init__.py", line 163, in safe_indexing
    return X.take(indices, axis=0)
MemoryError