Tensorflow “我如何解决?”;“内存不足”;在HPC(氩气)上运行KERA时?

Tensorflow “我如何解决?”;“内存不足”;在HPC(氩气)上运行KERA时?,tensorflow,keras,gpu,cluster-computing,conv-neural-network,Tensorflow,Keras,Gpu,Cluster Computing,Conv Neural Network,我有一个ConvLSTM神经网络编码为Keras。我向集群上的两个队列(一个GPU和另一个CPU)提交了相同的代码。 我在CPU上的代码正在运行,但在GPU上我遇到了一个错误,下面我复制了一行错误文件: “W tensorflow/core/common_runtime/bfc_分配器.cc:273]分配器 (GPU 0_bfc)试图分配3.12兆当前内存时内存不足 分配摘要如下。” 错误文件: Using TensorFlow backend. 2018-04-05 17:39:59.0594

我有一个ConvLSTM神经网络编码为Keras。我向集群上的两个队列(一个GPU和另一个CPU)提交了相同的代码。 我在CPU上的代码正在运行,但在GPU上我遇到了一个错误,下面我复制了一行错误文件:

“W tensorflow/core/common_runtime/bfc_分配器.cc:273]分配器 (GPU 0_bfc)试图分配3.12兆当前内存时内存不足 分配摘要如下。”

错误文件:

Using TensorFlow backend.
2018-04-05 17:39:59.059431: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-04-05 17:40:00.220946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:81:00.0
totalMemory: 15.90GiB freeMemory: 332.94MiB
2018-04-05 17:40:00.221266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:81:00.0, compute capability: 6.0)
/opt/apps/python/2.7.14_openmpi-2.1.2_parallel_studio-2017.4/lib/python2.7/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype uint8 was converted to float64 by MinMaxScaler.
  warnings.warn(msg, DataConversionWarning)
2018-04-05 17:40:50.577736: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.12MiB.  Current allocation summary follows.
2018-04-05 17:40:50.578144: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (256):   Total Chunks: 296, Chunks in use: 294. 74.0KiB allocated for chunks. 73.5KiB in use in bin. 9.3KiB client-requested in use in bin.
2018-04-05 17:40:50.578167: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (512):   Total Chunks: 39, Chunks in use: 39. 22.0KiB allocated for chunks. 22.0KiB in use in bin. 16.1KiB client-requested in use in bin.
2018-04-05 17:40:50.578179: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578192: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (2048):  Total Chunks: 14, Chunks in use: 14. 36.8KiB allocated for chunks. 36.8KiB in use in bin. 34.5KiB client-requested in use in bin.
2018-04-05 17:40:50.578203: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (4096):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578216: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (8192):  Total Chunks: 62, Chunks in use: 61. 882.2KiB allocated for chunks. 869.2KiB in use in bin. 857.8KiB client-requested in use in bin.
2018-04-05 17:40:50.578228: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (16384):     Total Chunks: 13, Chunks in use: 12. 223.0KiB allocated for chunks. 198.8KiB in use in bin. 190.1KiB client-requested in use in bin.
2018-04-05 17:40:50.578239: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (32768):     Total Chunks: 46, Chunks in use: 46. 2.53MiB allocated for chunks. 2.53MiB in use in bin. 2.53MiB client-requested in use in bin.
2018-04-05 17:40:50.578251: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (65536):     Total Chunks: 168, Chunks in use: 168. 13.19MiB allocated for chunks. 13.19MiB in use in bin. 13.10MiB client-requested in use in bin.
2018-04-05 17:40:50.578263: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (131072):    Total Chunks: 1, Chunks in use: 1. 135.8KiB allocated for chunks. 135.8KiB in use in bin. 80.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578276: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (262144):    Total Chunks: 243, Chunks in use: 243. 76.74MiB allocated for chunks. 76.74MiB in use in bin. 75.94MiB client-requested in use in bin.
2018-04-05 17:40:50.578287: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (524288):    Total Chunks: 3, Chunks in use: 3. 1.64MiB allocated for chunks. 1.64MiB in use in bin. 960.0KiB client-requested in use in bin.
2018-04-05 17:40:50.578297: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (1048576):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578309: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (2097152):   Total Chunks: 4, Chunks in use: 4. 12.50MiB allocated for chunks. 12.50MiB in use in bin. 12.50MiB client-requested in use in bin.
2018-04-05 17:40:50.578336: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (4194304):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578348: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (8388608):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578358: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (16777216):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578367: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (33554432):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578376: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (67108864):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578386: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578395: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-04-05 17:40:50.578406: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin for 3.12MiB was 2.00MiB, Chunk State: 
2018-04-05 17:40:50.578417: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000000 of size 1280
2018-04-05 17:40:50.578426: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000500 of size 256
2018-04-05 17:40:50.578433: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000600 of size 256
2018-04-05 17:40:50.578440: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c000700 of size 57600
2018-04-05 17:40:50.578448: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00e800 of size 512
2018-04-05 17:40:50.578456: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ea00 of size 768
2018-04-05 17:40:50.578464: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ed00 of size 256
2018-04-05 17:40:50.578471: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ee00 of size 256
2018-04-05 17:40:50.578478: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00ef00 of size 256
2018-04-05 17:40:50.578485: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f000 of size 256
2018-04-05 17:40:50.578493: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f100 of size 256
2018-04-05 17:40:50.578500: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f200 of size 256
2018-04-05 17:40:50.578507: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f300 of size 256
2018-04-05 17:40:50.578514: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f400 of size 256
2018-04-05 17:40:50.578522: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f500 of size 256
2018-04-05 17:40:50.578529: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c00f600 of size 57600
2018-04-05 17:40:50.578536: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01d700 of size 512
2018-04-05 17:40:50.578544: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01d900 of size 3072
2018-04-05 17:40:50.578551: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c01e500 of size 57600
2018-04-05 17:40:50.578559: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02c600 of size 512
2018-04-05 17:40:50.578571: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02c800 of size 768
2018-04-05 17:40:50.578579: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cb00 of size 256
2018-04-05 17:40:50.578586: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cc00 of size 256
2018-04-05 17:40:50.578593: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cd00 of size 256
2018-04-05 17:40:50.578600: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02ce00 of size 256
2018-04-05 17:40:50.578607: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02cf00 of size 256
2018-04-05 17:40:50.578614: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d000 of size 256
2018-04-05 17:40:50.578622: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d100 of size 256
2018-04-05 17:40:50.578629: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c02d200 of size 14592
2018-04-05 17:40:50.578637: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030b00 of size 256
2018-04-05 17:40:50.578644: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030c00 of size 256
2018-04-05 17:40:50.578652: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030d00 of size 256
2018-04-05 17:40:50.578659: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030e00 of size 256
2018-04-05 17:40:50.578666: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c030f00 of size 256
2018-04-05 17:40:50.578673: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031000 of size 256
2018-04-05 17:40:50.578681: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031100 of size 256
2018-04-05 17:40:50.578688: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031200 of size 256
2018-04-05 17:40:50.578695: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031300 of size 512
2018-04-05 17:40:50.578702: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c031500 of size 14592
2018-04-05 17:40:50.578709: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c034e00 of size 256
2018-04-05 17:40:50.578717: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c034f00 of size 256
2018-04-05 17:40:50.578724: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035000 of size 256
2018-04-05 17:40:50.578731: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035100 of size 256
2018-04-05 17:40:50.578738: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035200 of size 256
2018-04-05 17:40:50.578746: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035300 of size 256
2018-04-05 17:40:50.578753: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035400 of size 256
2018-04-05 17:40:50.578760: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035500 of size 256
2018-04-05 17:40:50.578767: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035600 of size 512
2018-04-05 17:40:50.578775: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c035800 of size 23296
2018-04-05 17:40:50.578782: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c03b300 of size 57600
2018-04-05 17:40:50.578789: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c049400 of size 512
2018-04-05 17:40:50.578797: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c049600 of size 57600
2018-04-05 17:40:50.578804: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c057700 of size 57600
2018-04-05 17:40:50.578811: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065800 of size 256
2018-04-05 17:40:50.578823: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065900 of size 256
2018-04-05 17:40:50.578830: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065a00 of size 256
2018-04-05 17:40:50.578838: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065b00 of size 256
2018-04-05 17:40:50.578845: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065c00 of size 256
2018-04-05 17:40:50.578852: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065d00 of size 256
2018-04-05 17:40:50.578859: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065e00 of size 256
2018-04-05 17:40:50.578867: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c065f00 of size 256
2018-04-05 17:40:50.578874: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c066000 of size 512
2018-04-05 17:40:50.578881: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c066200 of size 14592
2018-04-05 17:40:50.578888: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c069b00 of size 256
2018-04-05 17:40:50.578896: I tensorflow/core/common_runtime/bfc_allocator.cc:661] Chunk at 0x2b373c069c00 of size 256

CPU上的tensorflow需要将数据加载到内存中,而GPU上的tensorflow需要将数据加载到GPU内存中。这很可能是您出错的原因。您可以尝试减小批处理大小。

CPU上的tensorflow需要将数据加载到内存中,而GPU上的tensorflow需要将数据加载到GPU内存中。这很可能是您出错的原因。您可以尝试减小批处理大小。

在遇到错误之前,您可以验证gpu上使用了多少内存吗?我编辑了文章并添加了错误文件的第一部分。我不确定它是否有您要求的信息,如果没有,我如何获得它?好的,没有看到任何有用的信息。在不了解设置的情况下,很难给出建议,因为gpu内存不足以完成此任务。如果你不能进一步减少批量大小,那么这个模型对于你的GPU来说可能太大了。你知道图形卡拥有多少gpu内存吗?我正在大学拥有的集群上运行它,我试图找出答案。但模型并不庞大。它有4层ConvLSTM,每层20个过滤器,只需在变量中打开一个tf会话。Keras将自动与之关联在遇到错误之前,你能验证gpu上使用了多少内存吗?我编辑了这篇文章并添加了错误文件的第一部分。我不确定它是否有您要求的信息,如果没有,我如何获得它?好的,没有看到任何有用的信息。在不了解设置的情况下,很难给出建议,因为gpu内存不足以完成此任务。如果你不能进一步减少批量大小,那么这个模型对于你的GPU来说可能太大了。你知道图形卡拥有多少gpu内存吗?我正在大学拥有的集群上运行它,我试图找出答案。但模型并不庞大。它有4层ConvLSTM,每层20个过滤器,只需在变量中打开一个tf会话。Keras将自动与之关联