Python 该模型适用于单个GPU,但脚本在尝试适用于多个GPU时崩溃

Python 该模型适用于单个GPU,但脚本在尝试适用于多个GPU时崩溃,python,tensorflow,keras,Python,Tensorflow,Keras,我有一个可以在单个GPU上进行良好训练的模型,但当我尝试使用多GPU模型进行拟合时,我在脚本退出之前遇到了这个CUDA错误: F tensorflow/stream_executor/cuda/cuda_dnn.cc:521] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)bat

我有一个可以在单个GPU上进行良好训练的模型,但当我尝试使用多GPU模型进行拟合时,我在脚本退出之前遇到了这个CUDA错误:

F tensorflow/stream_executor/cuda/cuda_dnn.cc:521] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
我试图将模型实例的编译版本和未编译版本都传递给multi_gpu_model函数,但它没有改变任何东西。我这样称呼它:

multi_model = multi_gpu_model(model, gpus=4)
编译是这样完成的,不会产生任何错误:

multi_model.compile(
    optimizer=keras.optimizers.Adam(5e-4),
    loss=dice_coefficient_loss,
    metrics=[dice_coefficient]
            + get_label_wise_dice_coefficient_functions(n_labels))

def dice_coefficient(y_true, y_pred, smooth=1.):
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return ((2. * intersection + smooth)
            / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth))


def dice_coefficient_loss(y_true, y_pred):
    return -dice_coefficient(y_true, y_pred)


def label_wise_dice_coefficient(y_true, y_pred, label_index):
    return dice_coefficient(y_true[:, label_index], y_pred[:, label_index])


def get_label_dice_coefficient_function(label_index):
    f = functools.partial(label_wise_dice_coefficient, label_index=label_index)
    f.__setattr__('__name__', 'label_{0}_dice_coef'.format(label_index))
    return f


def get_label_wise_dice_coefficient_functions(n_labels):
    return [get_label_dice_coefficient_function(i) for i in range(n_labels)]
(这些功能和模型架构中的大部分都被盗了)

我使用的是来自conda main repo的python 3.6.6、tensorflow gpu 1.10.0、cudatoolkit 9.2、cudnn 7.2.1和keras contrib 2.0.8,在64位CentOS 7.4.1708上安装了pip/git

查看前面的日志行,似乎正确检测到多个GPU:

2018-10-09 16:30:19.977993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:20:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-10-09 16:30:20.318137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:21:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-10-09 16:30:20.595428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:22:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-10-09 16:30:20.953619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:23:00.0
totalMemory: 10.92GiB freeMemory: 10.74GiB
2018-10-09 16:30:20.967429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2018-10-09 16:30:22.415906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-09 16:30:22.415957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 2 3
2018-10-09 16:30:22.415965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y Y Y
2018-10-09 16:30:22.415971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N Y Y
2018-10-09 16:30:22.415982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2:   Y Y N Y
2018-10-09 16:30:22.415988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3:   Y Y Y N
2018-10-09 16:30:22.416681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10393 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:20:00.0, compute capability: 6.1)
2018-10-09 16:30:22.536003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10393 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:21:00.0, compute capability: 6.1)
2018-10-09 16:30:22.637811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10393 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:22:00.0, compute capability: 6.1)
2018-10-09 16:30:22.747698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10393 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:23:00.0, compute capability: 6.1)
2018-10-09 16:30:25,557.557:__main__:INFO:Compiling model
2018-10-09 16:30:25,634.634:__main__:INFO:Fitting model
2018-10-09 16:31:31.773355: F tensorflow/stream_executor/cuda/cuda_dnn.cc:521] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
/bin/bash: line 1: 160691 Aborted

非常感谢对我所做错误的任何帮助。

事实证明,当数据集中的样本数量不是批处理大小的倍数时,
多个模型gpu
.fit()
方法不喜欢它,即在我的情况下,gpu的数量。从我的数据集中删除一个样本解决了我的问题。我报告了这个错误。

您是否在CPU作用域下定义了模型?@nicoco从keras文档中,您可以创建具有CPU作用域的基本模型,然后调用multi_gpu实用程序进行训练。由于您有自定义的损耗和指标,我想知道您是否将基本模型放在了cpu或gpu上。对此我不是很确定。我查看了您共享的github链接,发现培训过程中添加了大量回调。在训练多gpu模型时出现了一些问题。你在使用回调吗?@nicoco可能!从中可以看出,将批描述符结构转换为cudnn张量时,检查失败。因此,也许根据给定的128个空间信息和特征计数,你可以尝试找出这个张量来自哪一层,这可能有助于缩小问题的范围?@nicoco如果你在到达这一层之前有InstanceNormalization层,并且它能够很好地处理它们,然后你可能会有一些信心,也许这不是问题所在。如果不是这样,那么您可以进行规范化,然后运行培训。我们可以通过这种方式消除一些可能性。有趣!我在谷歌上快速搜索了一下,有一篇文章建议。但是是的,如果一个GPU没有得到任何数据批处理,那么如果我们有这样的错误,那就太好了!