Python 3.x Bert标记器失败，出现“0”；ValueError:无法将NumPy数组转换为张量；_Python 3.x_Numpy_Tensorflow2.0_Tokenize_Bert Language Model

Python 3.x Bert标记器失败，出现“0”；ValueError:无法将NumPy数组转换为张量；

python-3.x numpy

Python 3.x Bert标记器失败，出现“0”；ValueError:无法将NumPy数组转换为张量；,python-3.x,numpy,tensorflow2.0,tokenize,bert-language-model,Python 3.x,Numpy,Tensorflow2.0,Tokenize,Bert Language Model,我正在尝试使用Bert标记器为ner任务创建文本编码。我使用的是tensorflow==2.2.0和transformers==4.5.1。文本已拆分为单词。所以这是一个由单词分割的句子列表 train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2) train_texts[0][134:150] 返回 array(['TERMS', 'CONDITIONS', 'Dela

我正在尝试使用Bert标记器为ner任务创建文本编码。我使用的是

tensorflow==2.2.0

和

transformers==4.5.1

。

文本

已拆分为单词。所以这是一个由单词分割的句子列表

train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2)
train_texts[0][134:150]

array(['TERMS', 'CONDITIONS', 'Delayed', 'payments', 'shall', 'be',
       'charged', 'interest', 'at', '24', 'p.a', 'from', 'duc', 'Goods',
       'Once', 'sold'], dtype=object)

然而，运行

tokenizer = TFBertForTokenClassification.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)

由于错误而失败

ValueError:无法将NumPy数组转换为张量（不支持的对象类型float）

对于自定义数据集，我将遵循中提到的步骤。我尝试按照其他建议的解决方案升级和降级Tensorflow版本，但没有成功

完整的错误日志如下所述

All model checkpoint layers were used when initializing TFBertForTokenClassification.

Some layers of TFBertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-3a20ac537a69> in <module>()
      1 tokenizer = TFBertForTokenClassification.from_pretrained('bert-base-uncased')
----> 2 train_encodings = tokenizer(list(train_texts), is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
      3 val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)

9 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
    817           return ops.convert_to_tensor_v2(x)
    818         return x
--> 819       inputs = nest.map_structure(_convert_non_tensor, inputs)
    820       input_list = nest.flatten(inputs)
    821 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
    615 
    616   return pack_sequence_as(
--> 617       structure[0], [func(*x) for x in entries],
    618       expand_composites=expand_composites)
    619 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
    615 
    616   return pack_sequence_as(
--> 617       structure[0], [func(*x) for x in entries],
    618       expand_composites=expand_composites)
    619 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py in _convert_non_tensor(x)
    815         # `SparseTensors` can't be converted to `Tensor`.
    816         if isinstance(x, (np.ndarray, float, int)):
--> 817           return ops.convert_to_tensor_v2(x)
    818         return x
    819       inputs = nest.map_structure(_convert_non_tensor, inputs)

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2(value, dtype, dtype_hint, name)
   1281       name=name,
   1282       preferred_dtype=dtype_hint,
-> 1283       as_ref=False)
   1284 
   1285 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
   1339 
   1340     if ret is None:
-> 1341       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1342 
   1343     if ret is NotImplemented:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/tensor_conversion_registry.py in _default_conversion_function(***failed resolving arguments***)
     50 def _default_conversion_function(value, dtype, name, as_ref):
     51   del as_ref  # Unused.
---> 52   return constant_op.constant(value, dtype, name=name)
     53 
     54 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
    261   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 262                         allow_broadcast=True)
    263 
    264 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    268   ctx = context.context()
    269   if ctx.executing_eagerly():
--> 270     t = convert_to_eager_tensor(value, ctx, dtype)
    271     if shape is None:
    272       return t

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
     94       dtype = dtypes.as_dtype(dtype).as_datatype_enum
     95   ctx.ensure_initialized()
---> 96   return ops.EagerTensor(value, ctx.device_name, dtype)
     97 
     98 

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).

初始化TFBERTFORTOKENCLASSION时使用了所有模型检查点层。
TFBERTFORTOKENCLASSION的某些层没有从bert base的模型检查点初始化，而是新初始化：[“分类器”]
您可能应该在下游任务中训练此模型，以便能够将其用于预测和推断。
---------------------------------------------------------------------------
ValueError回溯（最近一次调用上次）
在（）
1标记器=TFBertForTokenClassification.from_pretrained（'bert-base-uncased'））
---->2序列编码=标记器（列表（序列文本），被拆分为单词=真，返回偏移量=真，填充=真，截断=真）
3 val_encodings=tokenizer（val_text，is_split_为_words=True，return_offset_mapping=True，padding=True，truncation=True）
9帧
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base\u layer.py in uuuuu调用（self，*args，**kwargs）
817返回运算。将_转换为_张量_v2（x）
818返回x
-->819 inputs=nest.map\u结构（\u convert\u non\u张量，inputs）
820输入列表=嵌套。展平（输入）
821
/映射结构中的usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py（func，*structure，**kwargs）
615
616返回包顺序(
-->617结构[0]，[func（*x）表示条目中的x]，
618扩展_复合材料=扩展_复合材料）
619
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/nest.py in（.0）
615
616返回包顺序(
-->617结构[0]，[func（*x）表示条目中的x]，
618扩展_复合材料=扩展_复合材料）
619
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base\u layer.py in\u convert\u non\u tensor（x）
815#'SparseTensors'不能转换为'Tensor'。
816如果isinstance（x，（np.ndarray，float，int））：
-->817返回运算。将_转换为_张量_v2（x）
818返回x
819 inputs=nest.map\u结构（\u convert\u non\u张量，inputs）
/convert_to_tensor_v2中的usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py（值、数据类型、数据类型提示、名称）
1281 name=名称，
1282首选类型=类型提示，
->1283 as_ref=False）
1284
1285
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor（值、数据类型、名称、as_ref、首选数据类型、数据类型提示、ctx、接受的结果类型）
1339
1340如果ret为无：
->1341 ret=conversion\u func（值，dtype=dtype，name=name，as\u ref=as\u ref）
1342
1343如果未实施ret：
/函数中的usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/tensor\u conversion\u registry.py（***解析参数失败***）
50 def默认转换函数（值、数据类型、名称，作为参考）：
51 del as_ref#未使用。
--->52返回常量\运算常量（值，数据类型，名称=名称）
53
54
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py in constant（值、数据类型、形状、名称）
261返回\u常量\u impl（值、数据类型、形状、名称、验证\u形状=False，
-->262允许_广播=真）
263
264
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant\u op.py in\u constant\u impl（值、数据类型、形状、名称、验证形状、允许广播）
268 ctx=context.context（）
269如果ctx.executing_急切地（）
-->270 t=转换为张量（值、ctx、数据类型）
271如果形状为“无”：
272返回t
/convert\u to\u eager\u tensor中的usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant\u op.py（值、ctx、dtype）
94 dtype=dtypes.as\u dtype（dtype）.as\u datatype\u enum
95 ctx.确保_已初始化（）
--->96返回运算符（值，ctx.device\u名称，数据类型）
97
98
ValueError:无法将NumPy数组转换为张量（不支持的对象类型float）。

我认为标记器需要一个带字符串的普通python列表。否则，您可以尝试为numpy数组强制转换数据类型：

train\U text=train\U text.astype（'U'）