Python 3.x 如何从tf.py_函数返回张量字典？_Python 3.x_Tensorflow2.0_Huggingface Transformers

Python 3.x 如何从tf.py_函数返回张量字典？

python-3.x

Python 3.x 如何从tf.py_函数返回张量字典？,python-3.x,tensorflow2.0,huggingface-transformers,Python 3.x,Tensorflow2.0,Huggingface Transformers,通常，transformers标记器将输入编码为字典 def gen(): yield 1 def process_data(x): return ([ 101, 13366, 2131, 1035, 6819, 2094, 1035, 102 ], [ 1, 1, 1, 1, 1, 1, 1, 1 ]) def create_dict(input_ids, attention_mask): return {"input_ids": tf.conv

通常，transformers标记器将输入编码为字典

def gen():
  yield 1

def process_data(x):
  return ([ 101, 13366,  2131,  1035,  6819,  2094,  1035,  102 ],
          [ 1, 1, 1, 1, 1, 1, 1, 1 ])

def create_dict(input_ids, attention_mask):
  return {"input_ids": tf.convert_to_tensor(input_ids),
          "attention_mask": tf.convert_to_tensor(attention_mask)}

ds = (tf.data.Dataset
      .from_generator(gen, (tf.int32))
      .map(lambda x: tf.py_function(process_data, inp=[x], 
                                    Tout=(tf.int32, tf.int32)))
      .map(create_dict)
      .repeat())

for x in ds:
  print(x)
  break

{“输入ID”：tf.int32，“注意掩码”：tf.int32，“标记类型ID”：tf.int32}

为了更好地处理大型数据集的性能，最好实现一个管道，其中包括使用将标记器函数应用于输入数据集的每个元素。与Tensorflow教程中所做的完全相同：

但是，（用于包装map python函数）不支持返回如上所示的张量字典

例如，如果中的标记器（编码器）返回以下字典：

{
“输入_id”：[101、13366、2131、1035、6819、2094、1035、102]，
“注意面具”：[1,1,1,1,1,1,1,1,1]
}

如何设置的

Tout

参数以获得所需的张量字典：

{
“输入_id”：
注意:
}

？

tf.py\u函数不允许python dict作为返回类型

作为本例中的一种解决方法，您可以在

py\u函数中进行数据转换
然后调用另一个tf.map，而不使用py_函数
返回字典
def gen():
  yield 1

def process_data(x):
  return ([ 101, 13366,  2131,  1035,  6819,  2094,  1035,  102 ],
          [ 1, 1, 1, 1, 1, 1, 1, 1 ])

def create_dict(input_ids, attention_mask):
  return {"input_ids": tf.convert_to_tensor(input_ids),
          "attention_mask": tf.convert_to_tensor(attention_mask)}

ds = (tf.data.Dataset
      .from_generator(gen, (tf.int32))
      .map(lambda x: tf.py_function(process_data, inp=[x], 
                                    Tout=(tf.int32, tf.int32)))
      .map(create_dict)
      .repeat())

for x in ds:
  print(x)
  break

输出：
{'input_ids': <tf.Tensor: shape=(8,), dtype=int32, numpy=
array([  101, 13366,  2131,  1035,  6819,  2094,  1035,   102],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(8,), dtype=int32, numpy=array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)>}

{'input_id'：，'attention_mask'：}
tf.py_函数
不允许python dict作为返回类型
作为本例中的一种解决方法，您可以在py\u函数中进行数据转换
然后调用另一个tf.map，而不使用py_函数
返回字典
def gen():
  yield 1

def process_data(x):
  return ([ 101, 13366,  2131,  1035,  6819,  2094,  1035,  102 ],
          [ 1, 1, 1, 1, 1, 1, 1, 1 ])

def create_dict(input_ids, attention_mask):
  return {"input_ids": tf.convert_to_tensor(input_ids),
          "attention_mask": tf.convert_to_tensor(attention_mask)}

ds = (tf.data.Dataset
      .from_generator(gen, (tf.int32))
      .map(lambda x: tf.py_function(process_data, inp=[x], 
                                    Tout=(tf.int32, tf.int32)))
      .map(create_dict)
      .repeat())

for x in ds:
  print(x)
  break

输出：
{'input_ids': <tf.Tensor: shape=(8,), dtype=int32, numpy=
array([  101, 13366,  2131,  1035,  6819,  2094,  1035,   102],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(8,), dtype=int32, numpy=array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)>}

{'input_id'：，'attention_mask'：}
目前，我想这是最好的方法。谢谢你@Mahendra Singh MeenaFor现在，我想这是最好的方法。谢谢你@Mahendra Singh Meena