Tensorflow 如何将字符串编码为固定长度的张量以供TPU使用？_Tensorflow_Google Compute Engine_Tpu

Tensorflow 如何将字符串编码为固定长度的张量以供TPU使用？

tensorflow google-compute-engine

Tensorflow 如何将字符串编码为固定长度的张量以供TPU使用？,tensorflow,google-compute-engine,tpu,Tensorflow,Google Compute Engine,Tpu,我的标签中有一个tf.string tensor文件名，在使用GPU进行训练时可以正常工作，但当我使用TPU进行训练时，会出现以下错误： File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/tpu/tpu_feed.py", line 494, in generate_dequeue_op dtypes=self._tuple_types, shapes=sharded_shapes, name=full

我的标签中有一个tf.string tensor文件名，在使用GPU进行训练时可以正常工作，但当我使用TPU进行训练时，会出现以下错误：

File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/tpu/tpu_feed.py", line 494, in generate_dequeue_op
    dtypes=self._tuple_types, shapes=sharded_shapes, name=full_name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/tpu/ops/tpu_ops.py", line 241, in infeed_dequeue_tuple
    "{}".format(dtype, list(_SUPPORTED_INFEED_DTYPES)))
TypeError: <dtype: 'string'> is not a supported TPU infeed type. Supported types are: [tf.float32, tf.int32, tf.complex64, tf.int64, tf.bool, tf.bfloat16, tf.uint32]

然而，这似乎并没有给我一个静态形状的张量，这是TPU所需要的

总体问题：假设我有一个tf.string张量。如何将该字符串填充为固定长度（比如5000个字符），然后将其转换为UTF-8码点张量？可能不使用tf.RaggedTensor

2当tf.strings不用作功能输入时，是否有更通用的解决方法将其用于TPU。

我试图用GCP文档中的模型重现您的错误

在dataset.py文件中，有一个特定的函数负责执行标签转换：

def decode_label(label):
   label = tf.io.decode_raw(label, tf.uint8)  # tf.string -> [tf.uint8]
   label = tf.reshape(label, [])  # label is a scalar
   return tf.cast(label, tf.int32)

def decode_label_keep_string(label):    
   label = tf.reshape(label, [])  
   return label

如果删除转换：

def decode_label(label):
   label = tf.io.decode_raw(label, tf.uint8)  # tf.string -> [tf.uint8]
   label = tf.reshape(label, [])  # label is a scalar
   return tf.cast(label, tf.int32)

def decode_label_keep_string(label):    
   label = tf.reshape(label, [])  
   return label

您会收到相同的错误：TypeError:dtype:“string”不是受支持的TPU馈入类型。支持的类型有：[tf.float32、tf.int32、tf.complex64、tf.int64、tf.bool、tf.bfloat16、tf.uint32]

因此，我建议您使用decode_label函数中的方法来修复错误

我希望这一切顺利

啊！对不起，这没用。我可以很好地将字符串标签转换为整数标签。该字符串是培训示例的文件名/标识符。所以我需要通过传播。我尝试使用python的内置UTF-8支持通过TPU传递UTF代码点。这最终是可行的，但它非常粗糙，并引入了过多的内存需求。我希望Tensorflow可能有一个规范的解决方案，因为有这么多人使用它进行NLP。我检查了您尝试使用的方法tf.strings.unicode\u encode tf.RaggedTensor.from\u tensorbatch\u chars\u padded，padding=-1，输出_encoding='UTF-8'，它正好用于从int转换为字符串tf.strings.unicode_encode：将代码点向量转换为编码的字符串标量。你试过tf.strings.unicode\u decodetext\u utf8，改为输入\u encoding='UTF-8'吗？tf.strings.unicode_decode：将编码的字符串标量转换为码点向量。尝试了tf.RaggedTensor，但对我无效；不知何故，字符串仍然进入TPU图形。这里有相关的开源代码吗？我完全不知道NLP项目是如何做到这一点的。是的，tf.RaggedTensor不应该工作。您是否尝试过tf.strings.unicode\u decodetext\u utf8，input\u encoding='UTF-8'？