使用Tensorflow分布式训练时Tensorflow服务的保存模型

使用Tensorflow分布式训练时Tensorflow服务的保存模型,tensorflow,model,export,distributed,tensorflow-serving,Tensorflow,Model,Export,Distributed,Tensorflow Serving,我们计划在Tensorflow中实施分布式培训。为此,我们使用了Tensorflow Distributed()。 我们可以使用“图间异步复制训练”来实现分布式训练。下面是代码片段 with sv.prepare_or_wait_for_session(server.target) as sess: 我们已经按照如下定义了我们的培训主管 sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),

我们计划在Tensorflow中实施分布式培训。为此,我们使用了Tensorflow Distributed()。 我们可以使用“图间异步复制训练”来实现分布式训练。下面是代码片段

with sv.prepare_or_wait_for_session(server.target) as sess:

我们已经按照如下定义了我们的培训主管

sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                           logdir=logdir,
                           init_op=init_op,
                           saver=saver,
                           summary_op=summary_op,
                           global_step=global_step)
我们还使用下面的代码片段初始化了主管

with sv.prepare_or_wait_for_session(server.target) as sess:
然后我们在训练中通过了不同的批次。在这部分之前,一切正常。但是,当我们尝试为tensorflow服务保存/导出模型时,它没有生成正确的检查点文件集,因此我们无法在生产中为其服务。当通过tensorflow_model_服务器托管检查点文件时,我们发现以下错误


请注意,我们还尝试了以下方法来保存经过训练的图

i) SavedModelBuilder

builder = saved_model_builder.SavedModelBuilder(export_path)
ii)模型出口商

export_path = "/saved_graph/"
model_exporter.export(export_path, sess)
iii)tf.train.Saver-功能

tf.train.Saver
  • 但在上述情况下,我们都没有看到成功
我们找不到任何直接的文章,其中显示了一个完整的例子或详细解释的事情。我们已经通过以下参考链接

任何建议或参考都会大有帮助

多谢各位

--------------------------------------------------------------------- 根据建议,我们尝试在导出模型时使用“clear_devices=True”,但没有帮助。下面是我们使用的代码片段

    for epoch in range(training_epochs):

        epoch_num=0

        batch_count = int(num_img/batch_size)
        count = 0

        for i in range(batch_count):

            epoch_num=0

          # This will create batches out of out Training dataset and it will 
           pe passed to the feed_dict    
            batch_x, batch_y = 
            next_batch(batch_size,epoch_num,train_data,train_labels,num_img)

          # perform the operations we defined earlier on batch
            _, cost, step = sess.run([train_op, cross_entropy, global_step],

            feed_dict={X: batch_x, Y: batch_y})
            sess.run(tf.global_variables_initializer())
            builder = tf.saved_model.builder.SavedModelBuilder(path)
            builder.add_meta_graph_and_variables(
            sess,
            [tf.saved_model.tag_constants.SERVING],
            signature_def_map= {
            "magic_model": 
             tf.saved_model.signature_def_utils.predict_signature_def(
             inputs= {"image": X},
             outputs= {"prediction": preds})
             }, clear_devices=True)
            builder.save()

            sv.stop()
            print("Done!!")
当我们使用clear_devices=True时,我们得到了以下错误

Error: Traceback (most recent call last):
File "insulator_classifier.py", line 370, in <module>
tf.app.run()
File "/root/anaconda3/lib/python3.6/site-
packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "insulator_classifier.py", line 283, in main
}, clear_devices=False)
File "/root/anaconda3/lib/python3.6/site-
packages/tensorflow/python/saved_model/builder_impl.py", line 364, in 
add_meta_graph_and_variables
allow_empty=True)
File "/root/anaconda3/lib/python3.6/site-
packages/tensorflow/python/training/saver.py", line 1140, in __init__
self.build()
File "/root/anaconda3/lib/python3.6/site-
packages/tensorflow/python/training/saver.py", line 1172, in build
filename=self._filename)
File "/root/anaconda3/lib/python3.6/site-
packages/tensorflow/python/training/saver.py", line 677, in build
filename_tensor = constant_op.constant(filename or "model")
File "/root/anaconda3/lib/python3.6/site-
packages/tensorflow/python/framework/constant_op.py", line 106, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/root/anaconda3/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 2582, in create_op
self._check_not_finalized()
File "/root/anaconda3/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 2290, in 
_check_not_finalized
raise RuntimeError("Graph is finalized and cannot be modified.")
RuntimeError: Graph is finalized and cannot be modified.
正如我们所看到的,这个警告-

    WARNING:tensorflow:From test_classifier.py:283: Exporter.export (from tensorflow.contrib.session_bundle.exporter) is deprecated and will be removed after 2017-06-30.
因此,理想情况下,我们应该使用“tf.saved_model.builder.SavedModelBuilder”-但由于某些原因,这不起作用

还有什么建议吗


谢谢。

注意
clear\u devices=True

  • 如果使用
    SavedModelBuilder
    ,则在调用
    add\u meta\u graph()
    add\u meta\u graph\u和变量()时设置
    clear\u devices=True

  • 如果使用
    exporter
    ,则在构造
    exporter.exporter时设置
    clear\u devices=True

  • 更新:
  • 对于使用
    SavedModelBuilder
    问题,它不需要在每个历元之前创建
    SavedModelBuilder
    ,因此您应该在For循环之前移动行,也不需要在每个历元之后保存模型,您可以在For循环之后移动
    builder.save()
    。因此,代码如下所示:

    builder = tf.saved_model.builder.SavedModelBuilder(path)
    
    builder.add_meta_graph_and_variables(sess,
                                         [tf.saved_model.tag_constants.SERVING],
                                         signature_def_map = {"magic_model": tf.saved_model.signature_def_utils.predict_signature_def(inputs= {"image": X}, outputs= {"prediction": preds})},
                                         clear_devices=True)
    
    for epoch in range(training_epochs):
    
        epoch_num=0
    
        batch_count = int(num_img/batch_size)
        count = 0
    
        for i in range(batch_count):
    
            epoch_num=0
    
          # This will create batches out of out Training dataset and it will pe passed to the feed_dict
            batch_x, batch_y =
            next_batch(batch_size,epoch_num,train_data,train_labels,num_img)
    
          # perform the operations we defined earlier on batch
            _, cost, step = sess.run([train_op, cross_entropy, global_step],
    
            feed_dict={X: batch_x, Y: batch_y})
            sess.run(tf.global_variables_initializer())
    
    builder.save()
    sv.stop()
    print("Done!!")
    
  • 对于使用
    export.Exporter
    时,警告并不重要,您仍然可以通过TensorFlowServing加载文件


  • 谢谢天津顾。我们尝试了您的建议,但这是下面的错误。-----------------------------------------------------------------------------------------------------------------------------------------引发运行时错误(“图形已完成,无法修改”)RuntimeError:图表已定稿,无法修改。我已编辑了问题,其中包含有关使用“clear_devices=True”测试的更多信息。我们已使用建议2“exporter.exporter”测试了导出。它有效。我编辑了这些问题以了解更多细节。但是有没有关于用“tf.saved_model.builder.SavedModelBuilder”来解决这个问题的建议呢。谢谢。谢谢@天津顾。我们会尝试一下。为了澄清,您在使用SavedModelBuilder时遇到的剩余错误是什么?
        WARNING:tensorflow:From test_classifier.py:283: Exporter.export (from tensorflow.contrib.session_bundle.exporter) is deprecated and will be removed after 2017-06-30.
    
    builder = tf.saved_model.builder.SavedModelBuilder(path)
    
    builder.add_meta_graph_and_variables(sess,
                                         [tf.saved_model.tag_constants.SERVING],
                                         signature_def_map = {"magic_model": tf.saved_model.signature_def_utils.predict_signature_def(inputs= {"image": X}, outputs= {"prediction": preds})},
                                         clear_devices=True)
    
    for epoch in range(training_epochs):
    
        epoch_num=0
    
        batch_count = int(num_img/batch_size)
        count = 0
    
        for i in range(batch_count):
    
            epoch_num=0
    
          # This will create batches out of out Training dataset and it will pe passed to the feed_dict
            batch_x, batch_y =
            next_batch(batch_size,epoch_num,train_data,train_labels,num_img)
    
          # perform the operations we defined earlier on batch
            _, cost, step = sess.run([train_op, cross_entropy, global_step],
    
            feed_dict={X: batch_x, Y: batch_y})
            sess.run(tf.global_variables_initializer())
    
    builder.save()
    sv.stop()
    print("Done!!")