Python 3.x Azure数据块上未知输入/输出误差的Tensorflow估计器

Python 3.x Azure数据块上未知输入/输出误差的Tensorflow估计器,python-3.x,azure,tensorflow,databricks,azure-databricks,Python 3.x,Azure,Tensorflow,Databricks,Azure Databricks,我正在尝试基于本教程运行官方的BERT pretraining脚本,但主要的例外是我正在尝试使用Azure Databricks。当我尝试运行tensorflow估计器来训练网络时,它开始正常运行,从而保存了模型的第一次迭代。然而,当它试图保存第二个文件时,我得到了一个输入/输出错误,看起来它是由于试图重命名一个临时文件引起的。有人知道这个问题的解决办法吗 ----------------------------------------------------------------------

我正在尝试基于本教程运行官方的BERT pretraining脚本,但主要的例外是我正在尝试使用Azure Databricks。当我尝试运行tensorflow估计器来训练网络时,它开始正常运行,从而保存了模型的第一次迭代。然而,当它试图保存第二个文件时,我得到了一个输入/输出错误,看起来它是由于试图重命名一个临时文件引起的。有人知道这个问题的解决办法吗

---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<command-3548884146162520> in <module>()
----> 1 estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
   2874     finally:
   2875       rendezvous.record_done('training_loop')
-> 2876       rendezvous.raise_errors()
   2877 
   2878   def evaluate(self,

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec)
    129       else:
    130         logging.warn('Reraising captured error')
--> 131         six.reraise(typ, value, traceback)
    132 
    133     for k, (typ, value, traceback) in kept_errors:

/databricks/python/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
   2869           steps=steps,
   2870           max_steps=max_steps,
-> 2871           saving_listeners=saving_listeners)
   2872     except Exception:  # pylint: disable=broad-except
   2873       rendezvous.record_error('training_loop', sys.exc_info())

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
    365 
    366       saving_listeners = _check_listeners_type(saving_listeners)
--> 367       loss = self._train_model(input_fn, hooks, saving_listeners)
    368       logging.info('Loss for final step: %s.', loss)
    369       return self

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners)
   1156       return self._train_model_distributed(input_fn, hooks, saving_listeners)
   1157     else:
-> 1158       return self._train_model_default(input_fn, hooks, saving_listeners)
   1159 
   1160   def _train_model_default(self, input_fn, hooks, saving_listeners):

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners)
   1190       return self._train_with_estimator_spec(estimator_spec, worker_hooks,
   1191                                              hooks, global_step_tensor,
-> 1192                                              saving_listeners)
   1193 
   1194   def _train_model_distributed(self, input_fn, hooks, saving_listeners):

/databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _train_with_estimator_spec(self, estimator_spec, worker_hooks, hooks, global_step_tensor, saving_listeners)
   1482       any_step_done = False
   1483       while not mon_sess.should_stop():
-> 1484         _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
   1485         any_step_done = True
   1486     if not any_step_done:

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
    752         feed_dict=feed_dict,
    753         options=options,
--> 754         run_metadata=run_metadata)
    755 
    756   def run_step_fn(self, step_fn):

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
   1250             feed_dict=feed_dict,
   1251             options=options,
-> 1252             run_metadata=run_metadata)
   1253       except _PREEMPTION_ERRORS as e:
   1254         logging.info(

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, *args, **kwargs)
   1351         raise six.reraise(*original_exc_info)
   1352       else:
-> 1353         raise six.reraise(*original_exc_info)
   1354 
   1355 

/databricks/python/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, *args, **kwargs)
   1336   def run(self, *args, **kwargs):
   1337     try:
-> 1338       return self._sess.run(*args, **kwargs)
   1339     except _PREEMPTION_ERRORS:
   1340       raise

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
   1417               results=outputs[hook] if hook in outputs else None,
   1418               options=options,
-> 1419               run_metadata=run_metadata))
   1420     self._should_stop = self._should_stop or run_context.stop_requested
   1421 

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py in after_run(self, run_context, run_values)
    592       if self._timer.should_trigger_for_step(global_step):
    593         self._timer.update_last_triggered_step(global_step)
--> 594         if self._save(run_context.session, global_step):
    595           run_context.request_stop()
    596 

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py in _save(self, session, step)
    609       l.before_save(session, step)
    610 
--> 611     self._get_saver().save(session, self._save_path, global_step=step)
    612     self._summary_writer.add_session_log(
    613         SessionLog(

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/saver.py in save(self, sess, save_path, global_step, latest_filename, meta_graph_suffix, write_meta_graph, write_state, strip_default_attrs, save_debug_info)
   1181               all_model_checkpoint_paths=self.last_checkpoints,
   1182               latest_filename=latest_filename,
-> 1183               save_relative_paths=self._save_relative_paths)
   1184           self._MaybeDeleteOldCheckpoints(meta_graph_suffix=meta_graph_suffix)
   1185       except (errors.FailedPreconditionError, errors.NotFoundError) as exc:

/databricks/python/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_management.py in update_checkpoint_state_internal(save_dir, model_checkpoint_path, all_model_checkpoint_paths, latest_filename, save_relative_paths, all_model_checkpoint_timestamps, last_preserved_timestamp)
    240   # file.
    241   file_io.atomic_write_string_to_file(coord_checkpoint_filename,
--> 242                                       text_format.MessageToString(ckpt))
    243 
    244 

/databricks/python/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py in atomic_write_string_to_file(filename, contents, overwrite)
    538   write_string_to_file(temp_pathname, contents)
    539   try:
--> 540     rename(temp_pathname, filename, overwrite)
    541   except errors.OpError:
    542     delete_file(temp_pathname)

/databricks/python/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py in rename(oldname, newname, overwrite)
    500     errors.OpError: If the operation fails.
    501   """
--> 502   rename_v2(oldname, newname, overwrite)
    503 
    504 

/databricks/python/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py in rename_v2(src, dst, overwrite)
    517   """
    518   pywrap_tensorflow.RenameFile(
--> 519       compat.as_bytes(src), compat.as_bytes(dst), overwrite)
    520 
    521 

UnknownError: /dbfs/tmp/model/checkpoint.tmp2feb8d7a932249e7ba1a11f96d3cb334; Input/output error
---------------------------------------------------------------------------
UnknownError回溯(上次最近的调用)
在()
---->1个估计器序列(输入fn=序列输入fn,最大步数=序列步数)
/列车中的databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py(self、input_fn、hook、steps、max_steps、saving_侦听器)
2874最后:
2875集合。记录完成(“训练循环”)
->2876集合点。引发错误()
2877
2878 def评估(自我,
/raise_errors中的databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py(self,timeout_sec)
129其他:
130 logging.warn('Reraising captured error')
-->1316.重新发布(类型、值、回溯)
132
133对于k(类型、值、回溯),保留错误:
/reraise中的databricks/python/lib/python3.6/site-packages/six.py(tp,value,tb)
691如果值.\uuuu回溯\uuuuu不是tb:
692通过回溯(tb)提升值
-->693提高价值
694最后:
695值=无
/列车中的databricks/python/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py(self、input_fn、hook、steps、max_steps、saving_侦听器)
2869步=步,
2870最大步数=最大步数,
->2871保存\u侦听器=保存\u侦听器)
2872例外情况除外:#pylint:disable=broad except
2873集合点。记录错误('training\u loop',sys.exc\u info())
/列车中的databricks/python/lib/python3.6/site-packages/tensorflow\u estimator/python/estimator/estimator.py(self、input\fn、hook、steps、max\u steps、saving\u侦听器)
365
366保存\u侦听器=\u检查\u侦听器\u类型(保存\u侦听器)
-->367损失=自我训练模型(输入、挂钩、保存侦听器)
368 logging.info('最后一步丢失:%s',丢失)
369回归自我
/列车模型中的databricks/python/lib/python3.6/site-packages/tensorflow\u estimator/python/estimator/estimator.py(self、input\fn、hook、saving\u监听器)
1156返回自我。列车模式分布(输入、挂钩、保存侦听器)
1157其他:
->1158返回self.\u train\u model\u default(输入\u fn、挂钩、保存\u侦听器)
1159
1160 def\U train\U model\U默认值(自身、输入、挂钩、保存侦听器):
/默认情况下(self、input、hook、saving、saving、监听器)中的databricks/python/lib/python3.6/site-packages/tensorflow\u estimator/python/estimator/estimator.py
1190返回自我。带估计器规范的列车(估计器规范、工人规范、,
1191钩子,全局步进张量,
->1192保存(U侦听器)
1193
1194 def列车模型分布式(自、输入、挂钩、保存侦听器):
/databricks/python/lib/python3.6/site-packages/tensorflow\u estimator/python/estimator/estimator.py in\u train\u with\u estimator\u spec(self、estimator\u spec、worker\u hooks、hooks、global\u step\u tensor\u、saving\u监听器)
1482任何步骤完成=错误
1483非周一时,是否应停止():
->1484 u,损耗=一次运行([估计器规格培训操作,估计器规格损耗])
1485任何步骤完成=真
1486如果未完成任何步骤:
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored\u session.py(self、fetches、feed\u dict、options、run\u元数据)
752进刀盘=进刀盘,
753选项=选项,
-->754运行单元元数据=运行单元元数据)
755
756 def运行步骤步骤步骤fn(自身,步骤fn):
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored\u session.py(self、fetches、feed\u dict、options、run\u元数据)
1250进给量=进给量,
1251选项=选项,
->1252运行单元元数据=运行单元元数据)
1253除e中的_抢占_错误外:
1254.info(
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py(self,*args,**kwargs)
1351第六次提升。重新提升(*原始exc信息)
1352其他:
->1353提高六个。重新提升(*原始exc信息)
1354
1355
/reraise中的databricks/python/lib/python3.6/site-packages/six.py(tp,value,tb)
691如果值.\uuuu回溯\uuuuu不是tb:
692通过回溯(tb)提升值
-->693提高价值
694最后:
695值=无
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py(self,*args,**kwargs)
1336 def运行(自身、*args、**kwargs):
1337尝试:
->1338返回自运行(*args,**kwargs)
1339除抢占错误外:
1340升
/运行中的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/monitored\u session.py(self、fetches、feed\u dict、options、run\u元数据)
1417结果=输出[hook],如果hook in输出else None,
1418选项=选项,
->1419运行单元元数据=运行单元元数据)
1420 self.\u should\u stop=self.\u should\u stop或run\u context.stop\u请求
1421
/运行后的databricks/python/lib/python3.6/site-packages/tensorflow/python/training/basic\u session\u run\u hooks.py(self、run\u上下文、run\u值)
592如果