Python 为什么Keras的训练会在一段时间后慢下来?

Python 为什么Keras的训练会在一段时间后慢下来?,python,tensorflow,keras,deep-learning,Python,Tensorflow,Keras,Deep Learning,我遇到了一个问题,我的模特训练速度急剧放缓 发生的情况如下: Epoch 00001: val_loss did not improve from 0.03340 Run 27 of 40 | Epoch 61 of 100 (15000, 4410) (15000, 12) Train on 12000 samples, validate on 3000 samples Epoch 1/1 12000/12000 [==============================] - 2s 1

我遇到了一个问题,我的模特训练速度急剧放缓

发生的情况如下:


Epoch 00001: val_loss did not improve from 0.03340
Run 27 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 2s 156us/step - loss: 0.0420 - binary_accuracy: 0.9459 - accuracy: 0.9848 - val_loss: 0.0362 - val_binary_accuracy: 0.9501 - val_accuracy: 0.9876

Epoch 00001: val_loss did not improve from 0.03340
Run 28 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 2s 150us/step - loss: 0.0422 - binary_accuracy: 0.9431 - accuracy: 0.9851 - val_loss: 0.0395 - val_binary_accuracy: 0.9418 - val_accuracy: 0.9863

Epoch 00001: val_loss did not improve from 0.03340
Run 29 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 6s 474us/step - loss: 0.0454 - binary_accuracy: 0.9479 - accuracy: 0.9833 - val_loss: 0.0395 - val_binary_accuracy: 0.9475 - val_accuracy: 0.9856

Epoch 00001: val_loss did not improve from 0.03340
Run 30 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 701us/step - loss: 0.0462 - binary_accuracy: 0.9406 - accuracy: 0.9830 - val_loss: 0.0339 - val_binary_accuracy: 0.9502 - val_accuracy: 0.9882

Epoch 00001: val_loss did not improve from 0.03340
Run 31 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 646us/step - loss: 0.0457 - binary_accuracy: 0.9462 - accuracy: 0.9836 - val_loss: 0.0375 - val_binary_accuracy: 0.9417 - val_accuracy: 0.9861

Epoch 00001: val_loss did not improve from 0.03340
Run 32 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 640us/step - loss: 0.0471 - binary_accuracy: 0.9313 - accuracy: 0.9827 - val_loss: 0.0373 - val_binary_accuracy: 0.9446 - val_accuracy: 0.9868

Epoch 00001: val_loss did not improve from 0.03340
Run 33 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 669us/step - loss: 0.0423 - binary_accuracy: 0.9458 - accuracy: 0.9852 - val_loss: 0.0356 - val_binary_accuracy: 0.9510 - val_accuracy: 0.9873

Epoch 00001: val_loss did not improve from 0.03340
Run 34 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 648us/step - loss: 0.0441 - binary_accuracy: 0.9419 - accuracy: 0.9841 - val_loss: 0.0407 - val_binary_accuracy: 0.9357 - val_accuracy: 0.9849

Epoch 00001: val_loss did not improve from 0.03340
Run 35 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 9s 713us/step - loss: 0.0460 - binary_accuracy: 0.9473 - accuracy: 0.9829 - val_loss: 0.0423 - val_binary_accuracy: 0.9604 - val_accuracy: 0.9840

Epoch 00001: val_loss did not improve from 0.03340
Run 36 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 7s 557us/step - loss: 0.0508 - binary_accuracy: 0.9530 - accuracy: 0.9810 - val_loss: 0.0470 - val_binary_accuracy: 0.9323 - val_accuracy: 0.9820
我的GPU使用率没有减少(实际上增加了):

我的CPU使用率、时钟和GPU时钟(核心和内存)都保持不变。我的RAM使用量也大致保持不变

唯一奇怪的是我的总功率下降(以百分比表示):

我在某个地方读到,这是由于ADAM优化器的beta_1参数,将其设置为0.99应该可以解决问题,但问题仍然存在


还有其他原因导致这种情况发生吗?这看起来像是计算方面的问题,因为没有硬件/驱动程序问题的迹象。

如果有人有这个问题,我将列出一些可能有帮助的事情:

  • 尝试在ADAM优化器中将beta_1设置为0.99
  • 如果您多次运行model.fit(),在fit()之后添加此选项也可能会有所帮助:
    K.clear_session()
    (请确保以K
    的身份从keras导入后端导入)
  • 导入后将其拍打(如果使用tensorflow>2.0):
  • 如果您有一个打开的文件(在使用file.open()之后),请确保关闭(或者更好地使用
  • 确保后台没有其他任何东西可以使用GPU(游戏、大型网站等)
  • 检查页面文件的使用情况。由于pagefile明显比RAM慢,所以可能内存不足。执行
    delvariable
    可能会有所帮助。最坏的情况是,您必须加载较小的数据块或减小模型大小
  • 尝试在英伟达控制面板
  • 中设置GPU以达到最高性能
    如果有人对如何解决这样的问题有任何其他想法,请随意评论,我将编辑此答案。

    原因可能很多,我不明白为什么这个问题会被否决,但它可以帮助显示一些代码,如模型定义和拟合函数/生成器。几乎没有足够的信息给出答案,没有代码,只有一些图(没有人知道如何解释),因此这个问题对其他人毫无用处。再次,如果您在问题中添加模型定义+拟合调用代码,此答案将更有帮助。我高度怀疑这是我的模型定义和/或拟合调用,因为按照我答案中的步骤解决了我的问题,因此我发布了答案。
    config = tf.compat.v1.ConfigProto()
    
    config.gpu_options.allow_growth=True
    
    sess = tf.compat.v1.Session(config=config)