Keras 找不到相关的tensor remote_句柄:操作ID:14738,输出编号:0
我正在使用colab pro TPU实例进行面片图像分类。 我使用的是tensorflow版本2.3.0 调用model.fit时,我收到以下错误:Keras 找不到相关的tensor remote_句柄:操作ID:14738,输出编号:0,keras,google-colaboratory,tensorflow-datasets,tpu,data-pipeline,Keras,Google Colaboratory,Tensorflow Datasets,Tpu,Data Pipeline,我正在使用colab pro TPU实例进行面片图像分类。 我使用的是tensorflow版本2.3.0 调用model.fit时,我收到以下错误:InvalidArgumentError:找不到相关的tensor remote_句柄:Op ID:14738,Output num:0,跟踪如下: -------- InvalidArgumentError Traceback (most recent call last) <ipython-inpu
InvalidArgumentError:找不到相关的tensor remote_句柄:Op ID:14738,Output num:0
,跟踪如下:
--------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-20-5fd2ec1ce2f9> in <module>()
15 steps_per_epoch=STEPS_PER_EPOCH,
16 validation_data=dev_ds,
---> 17 validation_steps=VALIDATION_STEPS
18 )
6 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
106 def _method_wrapper(self, *args, **kwargs):
107 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
--> 108 return method(self, *args, **kwargs)
109
110 # Running inside `run_distribute_coordinator` already.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1084 data_handler._initial_epoch = ( # pylint: disable=protected-access
1085 self._maybe_load_initial_epoch_from_ckpt(initial_epoch))
-> 1086 for epoch, iterator in data_handler.enumerate_epochs():
1087 self.reset_metrics()
1088 callbacks.on_epoch_begin(epoch)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py in enumerate_epochs(self)
1140 if self._insufficient_data: # Set by `catch_stop_iteration`.
1141 break
-> 1142 if self._adapter.should_recreate_iterator():
1143 data_iterator = iter(self._dataset)
1144 yield epoch, data_iterator
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/data_adapter.py in should_recreate_iterator(self)
725 # each epoch.
726 return (self._user_steps is None or
--> 727 cardinality.cardinality(self._dataset).numpy() == self._user_steps)
728
729 def _validate_args(self, y, sample_weights, steps):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in numpy(self)
1061 """
1062 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
-> 1063 maybe_arr = self._numpy() # pylint: disable=protected-access
1064 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
1065
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in _numpy(self)
1029 return self._numpy_internal()
1030 except core._NotOkStatusException as e: # pylint: disable=protected-access
-> 1031 six.raise_from(core._status_to_exception(e.code, e.message), None) # pylint: disable=protected-access
1032
1033 @property
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: Unable to find the relevant tensor remote_handle: Op ID: 14738, Output num: 0
下面是创建和编译模型以及拟合数据集的代码,我使用了带有VGG16后端的keras自定义模型:
def create_model(input_shape, batch_size):
VGG16 = keras.applications.VGG16(include_top=False,input_shape=input_shape, weights='imagenet')
for layer in VGG16.layers:
layer.trainable = False
input_layer = keras.Input(shape=input_shape, batch_size=batch_size)
VGG_out = VGG16(input_layer)
x = Flatten(name='flatten', input_shape=(512,8,8))(VGG_out)
x = Dense(256, activation='relu', name='fc1')(x)
x = Dropout(0.5)(x)
x = Dense(1, activation='sigmoid', name='fc2')(x)
model = Model(input_layer, x)
model.summary()
return model
with strategy.scope():
model = create_model(INPUT_SHAPE, BATCH_SIZE)
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_ds,
epochs=5,
steps_per_epoch=STEPS_PER_EPOCH,
validation_data=dev_ds,
validation_steps=VALIDATION_STEPS
)
对于TPU初始化和策略我使用strategy=tf.distribute.TPUStrategy(解析器)
初始化代码如下所示:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
整个笔记本的输出副本可以在以下位置找到:在TPU上进行培训时,我遇到了相同的问题,但前提是我有num_epochs>1。你能找到解决方法吗?@Pooya448上面的错误解决了吗?
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))