Python Tensorflow TPU培训:NotFoundError并行交错数据集

Python Tensorflow TPU培训:NotFoundError并行交错数据集,python,tensorflow,google-cloud-platform,google-compute-engine,tensorflow-datasets,Python,Tensorflow,Google Cloud Platform,Google Compute Engine,Tensorflow Datasets,我试图在谷歌云平台(GCP)上使用TPU训练神经网络 我已将文件保存为本地tfrecords,并打开了一个运行在虚拟机(计算引擎)上的Jupyter笔记本,在那里我正在编写培训代码 我的代码一直执行到开始训练。然后我得到错误消息: NotFoundError:操作类型未在中注册“ParallelInterleaveDataset” 在n-b2696fa0-w-0上运行的二进制文件。确保操作和内核是正确的 在该进程中运行的二进制文件中注册。请注意,如果您是 从tf.contrib加载使用ops的已

我试图在谷歌云平台(GCP)上使用TPU训练神经网络

我已将文件保存为本地tfrecords,并打开了一个运行在虚拟机(计算引擎)上的Jupyter笔记本,在那里我正在编写培训代码

我的代码一直执行到开始训练。然后我得到错误消息:

NotFoundError:操作类型未在中注册“ParallelInterleaveDataset” 在n-b2696fa0-w-0上运行的二进制文件。确保操作和内核是正确的 在该进程中运行的二进制文件中注册。请注意,如果您是 从tf.contrib加载使用ops的已保存图形,访问(例如)
tf.contrib.resampler
应在导入图形之前完成,如下所示: 当第一次访问模块时,contrib ops被延迟注册

我在谷歌上搜索了一下,发现了这个问题。它指出,TPU的代码中不允许某些操作

但是,我从未使用过名为“ParallelInterleaveDataset”的函数。我的问题是: 这个问题的原因可能是什么?我可以做些什么来解决它并在TPU上培训我的网络?

--

为了完整起见,完整的错误消息:

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Error recorded from training_loop: Op type not registered 'ParallelInterleaveDataset' in binary running on n-b2696fa0-w-0. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1316       # Ensure any changes to the graph are reflected in the runtime.
-> 1317       self._extend_graph()
   1318       return self._call_tf_sessionrun(

~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py in _extend_graph(self)
   1351     with self._graph._session_run_lock():  # pylint: disable=protected-access
-> 1352       tf_session.ExtendSession(self._session)
   1353 

NotFoundError: Op type not registered 'ParallelInterleaveDataset' in binary running on n-b2696fa0-w-0. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

During handling of the above exception, another exception occurred:

NotFoundError                             Traceback (most recent call last)
<ipython-input-115-ee69fe04790e> in <module>
----> 1 tpu_estimator.train(input_fn=train_input_fn, steps=1)

~/yes/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
   2407       if ctx.is_running_on_cpu(is_export_mode=False):
   2408         with ops.device('/device:CPU:0'):
-> 2409           return input_fn(**kwargs)
   2410 
   2411       # For TPU computation, input_fn should be invoked in a tf.while_loop for

~/yes/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py in raise_errors(self, timeout_sec)
    126       else:
    127         logging.warn('Reraising captured error')
--> 128         six.reraise(typ, value, traceback)
    129 
    130     for k, (typ, value, traceback) in kept_errors:

~/yes/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/yes/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
   2401       if batch_size_for_input_fn is not None:
   2402         _add_item_to_params(kwargs['params'], _BATCH_SIZE_KEY,
-> 2403                             batch_size_for_input_fn)
   2404 
   2405       # For export_savedmodel, input_fn is never passed to Estimator. So,

~/yes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)

~/yes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners)

~/yes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners)

~/yes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py in _train_with_estimator_spec(self, estimator_spec, worker_hooks, hooks, global_step_tensor, saving_listeners)

~/yes/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in MonitoredTrainingSession(master, is_chief, checkpoint_dir, scaffold, hooks, chief_only_hooks, save_checkpoint_secs, save_summaries_steps, save_summaries_secs, config, stop_grace_period_secs, log_step_count_steps, max_wait_secs, save_checkpoint_steps, summary_dir)
    502 
    503   if hooks:
--> 504     all_hooks.extend(hooks)
    505   return MonitoredSession(
    506       session_creator=session_creator,

~/yes/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in __init__(self, session_creator, hooks, stop_grace_period_secs)
    919   * it cannot be sent to tf.train.start_queue_runners.
    920 
--> 921   Args:
    922     session_creator: A factory object to create session. Typically a
    923       `ChiefSessionCreator` which is the default one.

~/yes/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in __init__(self, session_creator, hooks, should_recover, stop_grace_period_secs)
    641 
    642     # Create the session.
--> 643     self._coordinated_creator = self._CoordinatedSessionCreator(
    644         session_creator=session_creator or ChiefSessionCreator(),
    645         hooks=self._hooks,

~/yes/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in __init__(self, sess_creator)
   1105 
   1106   Calls to `run()` are delegated to the wrapped session.  If a call raises the
-> 1107   exception `tf.errors.AbortedError` or `tf.errors.UnavailableError`, the
   1108   wrapped session is closed, and a new one is created by calling the factory
   1109   again.

~/yes/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in _create_session(self)
   1110   """
   1111 
-> 1112   def __init__(self, sess_creator):
   1113     """Create a new `_RecoverableSession`.
   1114 

~/yes/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in create_session(self)
    798       self.coord = None
    799       self.tf_sess = None
--> 800       self._stop_grace_period_secs = stop_grace_period_secs
    801 
    802     def create_session(self):

~/yes/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py in create_session(self)
    564         self._master,
    565         saver=self._scaffold.saver,
--> 566         checkpoint_dir=self._checkpoint_dir,
    567         checkpoint_filename_with_path=self._checkpoint_filename_with_path,
    568         config=self._config,

~/yes/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py in prepare_session(self, master, init_op, saver, checkpoint_dir, checkpoint_filename_with_path, wait_for_checkpoint, max_wait_secs, config, init_feed_dict, init_fn)
    292     if not local_init_success:
    293       raise RuntimeError(
--> 294           "Init operations did not make model ready for local_init.  "
    295           "Init op: %s, init fn: %s, error: %s" % (_maybe_name(init_op),
    296                                                    init_fn,

~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    927     try:
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:
    931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:
   1154       results = []

~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1326     if handle is None:
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:
   1330       return self._do_call(_prun_fn, handle, feeds, fetches)

~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1346           pass
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 
   1350   def _extend_graph(self):

NotFoundError: Op type not registered 'ParallelInterleaveDataset' in binary running on n-b2696fa0-w-0. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
INFO:tensorflow:Calling model\u fn。
信息:tensorflow:创建检查点SaveRhook。
信息:tensorflow:已完成调用模型\u fn。
信息:tensorflow:TPU作业名称TPU\U工人
信息:tensorflow:图表已定稿。
信息:tensorflow:在n-b2696fa0-w-0上运行的二进制文件中,训练_循环记录的错误:Op类型未注册“ParallelInterleaveDataset”。确保在该进程中运行的二进制文件中注册了Op和内核。请注意,如果您正在从tf.contrib加载使用ops的已保存图形,则应在导入图形之前访问(例如)`tf.contrib.resampler',因为在首次访问模块时,contrib ops是延迟注册的。
信息:tensorflow:标记为完成的训练循环
警告:tensorflow:重新释放捕获的错误
---------------------------------------------------------------------------
NotFoundError回溯(最近一次调用上次)
调用中的~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py(self,fn,*args)
1333尝试:
->1334返回fn(*args)
1335错误除外。操作错误为e:
运行fn中的~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py(提要、获取列表、目标列表、选项、运行元数据)
1316#确保图形的任何更改都反映在运行时中。
->1317自扩展图()
1318返回self.\u调用\u tf\u sessionrun(
图(self)中的~/yes/lib/python3.6/site-packages/tensorflow/python/client/session.py
1351带self._graph._session_run_lock():#pylint:disable=受保护访问
->1352 TFU会话。扩展会话(自会话)
1353
NotFoundError:Op类型未在n-b2696fa0-w-0上运行的二进制文件中注册“ParallelInterleaveDataset”。请确保Op和内核已在该进程中运行的二进制文件中注册。请注意,如果您正在从tf.contrib加载使用ops的已保存图形,请访问(例如)`tf.contrib.resampler`应该在导入图形之前完成,因为在第一次访问模块时,contrib操作是延迟注册的。
在处理上述异常期间,发生了另一个异常:
NotFoundError回溯(最近一次调用上次)
在里面
---->1个tpu\u估计器。序列(输入\u fn=序列\u输入\u fn,步长=1)
列车中的~/yes/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu\u estimator.py(self、input\u fn、hook、steps、max\u steps、saving\u监听器)
2407如果ctx.正在cpu上运行(导出模式=假):
2408带操作设备('/device:CPU:0'):
->2409返回输入信号(**kwargs)
2410
2411#对于TPU计算,应在tf.while#循环中调用输入_fn
提升错误中的~/yes/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error\u handling.py(self,timeout\u sec)
126.其他:
127日志记录。警告('重新发送捕获的错误')
-->128六.重新发布(类型、值、回溯)
129
130对于k(典型值、值、回溯),保留错误:
reraise中的~/yes/lib/python3.6/site-packages/six.py(tp,value,tb)
691如果值.\uuuu回溯\uuuuu不是tb:
692通过回溯(tb)提升值
-->693提高价值
694最后:
695值=无
列车中的~/yes/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu\u estimator.py(self、input\u fn、hook、steps、max\u steps、saving\u监听器)
2401如果输入的批次大小不是无:
2402 _将_项目_添加到_参数(kwargs['params'),_批次_大小_键,
->2403批次尺寸(用于输入)
2404
2405#对于导出保存的模型,输入永远不会传递给估计器。因此,
列车中的~/yes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py(self、input\fn、hook、steps、max\u steps、saving\u侦听器)
列车模型中的~/yes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py(self、input、hook、saving\u侦听器)
~/yes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py in\u train\u model\u default(self、input\u fn、hook、saving\u监听器)
~/yes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py在带有估计器规范的培训中(self、估计器规范、worker钩子、钩子、全局步骤张量、保存侦听器)
MonitoredTrainingSession中的~/yes/lib/python3.6/site-packages/tensorflow/python/training/monitored\u session.py(master、is\u chief、checkpoint\u dir、scaffold、hooks、chief\u only\u hooks、save\u checkpoint\u secs、save\u summaries\u steps、save\u summaries\u secs、save\u summaries\u secs\u secs\u secs、save\u grammaries\u secs\u secs\u、config、stop\u grade\u grade\
502
503如果挂钩:
-->504所有钩。延伸(钩)
505返回监视会话(