Python 3.x Azure数据块上未知输入/输出误差的Tensorflow估计器_Python 3.x_Azure_Tensorflow_Databricks_Azure Databricks

Python 3.x Azure数据块上未知输入/输出误差的Tensorflow估计器

python-3.x azure tensorflow

Python 3.x Azure数据块上未知输入/输出误差的Tensorflow估计器,python-3.x,azure,tensorflow,databricks,azure-databricks,Python 3.x,Azure,Tensorflow,Databricks,Azure Databricks,我正在尝试基于本教程运行官方的BERT pretraining脚本，但主要的例外是我正在尝试使用Azure Databricks。当我尝试运行tensorflow估计器来训练网络时，它开始正常运行，从而保存了模型的第一次迭代。然而，当它试图保存第二个文件时，我得到了一个输入/输出错误，看起来它是由于试图重命名一个临时文件引起的。有人知道这个问题的解决办法吗 ----------------------------------------------------------------------

我正在尝试基于本教程运行官方的BERT pretraining脚本，但主要的例外是我正在尝试使用Azure Databricks。当我尝试运行tensorflow估计器来训练网络时，它开始正常运行，从而保存了模型的第一次迭代。然而，当它试图保存第二个文件时，我得到了一个输入/输出错误，看起来它是由于试图重命名一个临时文件引起的。有人知道这个问题的解决办法吗

---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<command-3548884146162520> in <module>()
----> 1 estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
   2874     finally:
   2875       rendezvous.record_done('training_loop')
-> 2876       rendezvous.raise_errors()
   2877 
   2878   def evaluate(self,

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec)
    129       else:
    130         logging.warn('Reraising captured error')
--> 131         six.reraise(typ, value, traceback)
    132 
    133     for k, (typ, value, traceback) in kept_errors:

/databricks/python/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
   2869           steps=steps,
   2870           max_steps=max_steps,
-> 2871           saving_listeners=saving_listeners)
   2872     except Exception:  # pylint: disable=broad-except
   2873       rendezvous.record_error('training_loop', sys.exc_info())

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
    365 
    366       saving_listeners = _check_listeners_type(saving_listeners)
--> 367       loss = self._train_model(input_fn, hooks, saving_listeners)
    368       logging.info('Loss for final step: %s.', loss)
    369       return self

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners)
   1156       return self._train_model_distributed(input_fn, hooks, saving_listeners)
   1157     else:
-> 1158       return self._train_model_default(input_fn, hooks, saving_listeners)
   1159 
   1160   def _train_model_default(self, input_fn, hooks, saving_listeners):

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners)
   1190       return self._train_with_estimator_spec(estimator_spec, worker_hooks,
   1191                                              hooks, global_step_tensor,
-> 1192                                              saving_listeners)
   1193 
   1194   def _train_model_distributed(self, input_fn, hooks, saving_listeners):

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _train_with_estimator_spec(self, estimator_spec, worker_hooks, hooks, global_step_tensor, saving_listeners)
   1482       any_step_done = False
   1483       while not mon_sess.should_stop():
-> 1484         _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
   1485         any_step_done = True
   1486     if not any_step_done:

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
    752         feed_dict=feed_dict,
    753         options=options,
--> 754         run_metadata=run_metadata)
    755 
    756   def run_step_fn(self, step_fn):

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
   1250             feed_dict=feed_dict,
   1251             options=options,
-> 1252             run_metadata=run_metadata)
   1253       except _PREEMPTION_ERRORS as e:
   1254         logging.info(

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, *args, **kwargs)
   1351         raise six.reraise(*original_exc_info)
   1352       else:
-> 1353         raise six.reraise(*original_exc_info)
   1354 
   1355 

/databricks/python/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, *args, **kwargs)
   1336   def run(self, *args, **kwargs):
   1337     try:
-> 1338       return self._sess.run(*args, **kwargs)
   1339     except _PREEMPTION_ERRORS:
   1340       raise

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
   1417               results=outputs[hook] if hook in outputs else None,
   1418               options=options,
-> 1419               run_metadata=run_metadata))
   1420     self._should_stop = self._should_stop or run_context.stop_requested
   1421 

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py in after_run(self, run_context, run_values)
    592       if self._timer.should_trigger_for_step(global_step):
    593         self._timer.update_last_triggered_step(global_step)
--> 594         if self._save(run_context.session, global_step):
    595           run_context.request_stop()
    596 

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py in _save(self, session, step)
    609       l.before_save(session, step)
    610 
--> 611     self._get_saver().save(session, self._save_path, global_step=step)
    612     self._summary_writer.add_session_log(
    613         SessionLog(

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/saver.py in save(self, sess, save_path, global_step, latest_filename, meta_graph_suffix, write_meta_graph, write_state, strip_default_attrs, save_debug_info)
   1181               all_model_checkpoint_paths=self.last_checkpoints,
   1182               latest_filename=latest_filename,
-> 1183               save_relative_paths=self._save_relative_paths)
   1184           self._MaybeDeleteOldCheckpoints(meta_graph_suffix=meta_graph_suffix)
   1185       except (errors.FailedPreconditionError, errors.NotFoundError) as exc:

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_management.py in update_checkpoint_state_internal(save_dir, model_checkpoint_path, all_model_checkpoint_paths, latest_filename, save_relative_paths, all_model_checkpoint_timestamps, last_preserved_timestamp)
    240   # file.
    241   file_io.atomic_write_string_to_file(coord_checkpoint_filename,
--> 242                                       text_format.MessageToString(ckpt))
    243 
    244 

/databricks/python/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py in atomic_write_string_to_file(filename, contents, overwrite)
    538   write_string_to_file(temp_pathname, contents)
    539   try:
--> 540     rename(temp_pathname, filename, overwrite)
    541   except errors.OpError:
    542     delete_file(temp_pathname)

/databricks/python/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py in rename(oldname, newname, overwrite)
    500     errors.OpError: If the operation fails.
    501   """
--> 502   rename_v2(oldname, newname, overwrite)
    503 
    504 

/databricks/python/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py in rename_v2(src, dst, overwrite)
    517   """
    518   pywrap_tensorflow.RenameFile(
--> 519       compat.as_bytes(src), compat.as_bytes(dst), overwrite)
    520 
    521 

UnknownError: /dbfs/tmp/model/checkpoint.tmp2feb8d7a932249e7ba1a11f96d3cb334; Input/output error

---------------------------------------------------------------------------
UnknownError回溯（上次最近的调用）
在（）
---->1个估计器序列（输入fn=序列输入fn，最大步数=序列步数）
/列车中的databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py（self、input_fn、hook、steps、max_steps、saving_侦听器）
2874最后：
2875集合。记录完成（“训练循环”）
->2876集合点。引发错误（）
2877
2878 def评估（自我，
/raise_errors中的databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py（self，timeout_sec）
129其他：
130 logging.warn（'Reraising captured error'）
-->1316.重新发布（类型、值、回溯）
132
133对于k（类型、值、回溯），保留错误：
/reraise中的databricks/python/lib/python3.6/site-packages/six.py（tp，value，tb）
691如果值.\uuuu回溯\uuuuu不是tb：
692通过回溯（tb）提升值
-->693提高价值
694最后：
695值=无
/列车中的databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py（self、input_fn、hook、steps、max_steps、saving_侦听器）
2869步=步，
2870最大步数=最大步数，
->2871保存\u侦听器=保存\u侦听器）
2872例外情况除外：#pylint:disable=broad except
2873集合点。记录错误（'training\u loop'，sys.exc\u info（））
/列车中的databricks/python/lib/python3.6/site-packages/tensorflow\u estimator/python/estimator/estimator.py（self、input\fn、hook、steps、max\u steps、saving\u侦听器）
365
366保存\u侦听器=\u检查\u侦听器\u类型（保存\u侦听器）
-->367损失=自我训练模型（输入、挂钩、保存侦听器）
368 logging.info（'最后一步丢失：%s'，丢失）
369回归自我
/列车模型中的databricks/python/lib/python3.6/site-packages/tensorflow\u estimator/python/estimator/estimator.py（self、input\fn、hook、saving\u监听器）
1156返回自我。列车模式分布（输入、挂钩、保存侦听器）
1157其他：
->1158返回self.\u train\u model\u default（输入\u fn、挂钩、保存\u侦听器）
1159
1160 def\U train\U model\U默认值（自身、输入、挂钩、保存侦听器）：
/默认情况下（self、input、hook、saving、saving、监听器）中的databricks/python/lib/python3.6/site-packages/tensorflow\u estimator/python/estimator/estimator.py
1190返回自我。带估计器规范的列车（估计器规范、工人规范、，
1191钩子，全局步进张量，
->1192保存（U侦听器）
1193
1194 def列车模型分布式（自、输入、挂钩、保存侦听器）：
/databricks/python/lib/python3.6/site-packages/tensorflow\u estimator/python/estimator/estimator.py in\u train\u with\u estimator\u spec（self、estimator\u spec、worker\u hooks、hooks、global\u step\u tensor\u、saving\u监听器）
1482任何步骤完成=错误
1483非周一时，是否应停止（）：
->1484 u，损耗=一次运行（[估计器规格培训操作，估计器规格损耗]）
1485任何步骤完成=真
1486如果未完成任何步骤：
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored\u session.py（self、fetches、feed\u dict、options、run\u元数据）
752进刀盘=进刀盘，
753选项=选项，
-->754运行单元元数据=运行单元元数据）
755
756 def运行步骤步骤步骤fn（自身，步骤fn）：
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored\u session.py（self、fetches、feed\u dict、options、run\u元数据）
1250进给量=进给量，
1251选项=选项，
->1252运行单元元数据=运行单元元数据）
1253除e中的_抢占_错误外：
1254.info(
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py（self，*args，**kwargs）
1351第六次提升。重新提升（*原始exc信息）
1352其他：
->1353提高六个。重新提升（*原始exc信息）
1354
1355
/reraise中的databricks/python/lib/python3.6/site-packages/six.py（tp，value，tb）
691如果值.\uuuu回溯\uuuuu不是tb：
692通过回溯（tb）提升值
-->693提高价值
694最后：
695值=无
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py（self，*args，**kwargs）
1336 def运行（自身、*args、**kwargs）：
1337尝试：
->1338返回自运行（*args，**kwargs）
1339除抢占错误外：
1340升
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored\u session.py（self、fetches、feed\u dict、options、run\u元数据）
1417结果=输出[hook]，如果hook in输出else None，
1418选项=选项，
->1419运行单元元数据=运行单元元数据）
1420 self.\u should\u stop=self.\u should\u stop或run\u context.stop\u请求
1421
/运行后的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/basic\u session\u run\u hooks.py（self、run\u上下文、run\u值）
592如果