Python 使用scikit学习执行交叉验证时的MemoryError
我正在尝试对一些dicom图像进行SVM分类-大小(91109,91))。总共有400张图像,每个受试者要么健康(0),要么扫描呈阳性(1) 我已经编写了一段简单的代码来循环一个目录中的所有DICOM,获取像素计数并将其传递给numpy数组并展平该数组。对于每个dicom文件,我查找一个图像状态为0或1的csv文件,并将其添加到numpy数组中。在这个循环的最后,我有一个2d numpy阵列,在numpy阵列的一行中,每个患者的像素计数和状态Python 使用scikit学习执行交叉验证时的MemoryError,python,numpy,out-of-memory,scikit-learn,cross-validation,Python,Numpy,Out Of Memory,Scikit Learn,Cross Validation,我正在尝试对一些dicom图像进行SVM分类-大小(91109,91))。总共有400张图像,每个受试者要么健康(0),要么扫描呈阳性(1) 我已经编写了一段简单的代码来循环一个目录中的所有DICOM,获取像素计数并将其传递给numpy数组并展平该数组。对于每个dicom文件,我查找一个图像状态为0或1的csv文件,并将其添加到numpy数组中。在这个循环的最后,我有一个2d numpy阵列,在numpy阵列的一行中,每个患者的像素计数和状态 import os import numpy as
import os
import numpy as np
import dicom
import sklearn
import matplotlib.pyplot as plt
from sklearn import decomposition
from sklearn import cross_validation
from sklearn import svm
import csv
import re
dirName = '/home/nm/MachineLearning/DaTSCAN/PPMI/'
# Name of csv file
results_dat='PPMIdatabase.csv'
results_path=os.path.join("/",dirName,results_dat)
# make an empty array that we will populate with dicom image array values
data = []
for filename in os.listdir(dirName):
dicom_file = os.path.join("/",dirName,filename)
if os.path.isfile(dicom_file) and filename.endswith(".dcm"):
try:
# check for study in csv file to get diagnosis
#Get study number from dicom string
study_id = int(re.search(r'\d+', filename).group())
with open(results_path, 'r') as file:
reader = csv.reader(file)
search_group = [line[1] for line in reader if line[0] == str(study_id)]
group = str(search_group[0])
#HC 0 # PD 1
if group == 'HC':
group_id = 0
else:
group_id = 1
ds = dicom.read_file(dicom_file)
img = ds.pixel_array
a = np.reshape(img,[img.size,1],'C')
# Add group_id to a
a = np.insert(a,0,group_id)
data.append(a)
except InvalidDicomError:
print("File %s cannot be opened by dicom.read_file" %(filename))
#make python list to numpy array
full_data = np.array(data)
# Want to predict Y
Y = full_data[:,0] # first row of array is classification status 0 or 1
# Image data
X = full_data[:, 1:]
然后我想运行交叉验证来评估估计器的性能(使用scikitlearn
)
但是,我遇到了以下错误,提示内存问题
Traceback (most recent call last):
File "cross_validation.py", line 56, in <module>
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,Y, test_size=0.4,random_state=0)
File "/home/nm/.local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1919, in train_test_split
safe_indexing(a, test)) for a in arrays))
File "/home/nm/.local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1919, in <genexpr>
safe_indexing(a, test)) for a in arrays))
File "/home/nm/.local/lib/python2.7/site-packages/sklearn/utils/__init__.py", line 163, in safe_indexing
return X.take(indices, axis=0)
MemoryError
回溯(最近一次呼叫最后一次):
文件“cross_validation.py”,第56行,在
X_序列,X_测试,y_序列,y_测试=交叉验证。序列测试分割(X,y,测试大小=0.4,随机状态=0)
文件“/home/nm/.local/lib/python2.7/site packages/sklearn/cross\u validation.py”,第1919行,列车测试分割
安全索引(a,测试)(用于数组中的a)
文件“/home/nm/.local/lib/python2.7/site packages/sklearn/cross_validation.py”,第1919行,在
安全索引(a,测试)(用于数组中的a)
文件“/home/nm/.local/lib/python2.7/site packages/sklearn/utils/_init__.py”,第163行,在安全索引中
返回X.take(索引,轴=0)
记忆者
我的numpy数组太大了。我该如何解决这个问题?您能确认一下
完整数据。shape
和完整数据。dtype
是什么吗?你有多少内存?@ali_m dtype是int16,8Gb的RAMshape是(400902630)。考虑到这些尺寸和数据类型,X
将是722MB,而X_train
和X_test
将分别消耗433MB和288MB(相比之下,Y
s非常小)。我看不出有什么理由会在这里遇到内存错误
,除非在调用train\u test\u split
时,您的可用内存不足~722MB,或者您对数组的描述不准确。好了,开始吧!无论安装了多少RAM,32位Python进程都无法处理超过2GB的内存。你看到的记忆错误最可能的解释是你达到了这个极限。您应该切换到64位版本的Python(以及64位操作系统,如果您使用的是32位Xubuntu)。
Traceback (most recent call last):
File "cross_validation.py", line 56, in <module>
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,Y, test_size=0.4,random_state=0)
File "/home/nm/.local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1919, in train_test_split
safe_indexing(a, test)) for a in arrays))
File "/home/nm/.local/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1919, in <genexpr>
safe_indexing(a, test)) for a in arrays))
File "/home/nm/.local/lib/python2.7/site-packages/sklearn/utils/__init__.py", line 163, in safe_indexing
return X.take(indices, axis=0)
MemoryError