Tensorflow bert标记未知单词_Tensorflow

Tensorflow bert标记未知单词

tensorflow

Tensorflow bert标记未知单词,tensorflow,Tensorflow,我目前正在做以下tf教程：在不同的句子上测试tokenize函数的输出，我想知道在标记未知单词时会发生什么加载模式： bert_model_name = 'bert_en_uncased_L-12_H-768_A-12' tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3' tfhub_handle_preprocess = 'https://tfhub.dev/ten

我目前正在做以下tf教程：

在不同的句子上测试tokenize函数的输出，我想知道在标记未知单词时会发生什么

加载模式：

bert_model_name = 'bert_en_uncased_L-12_H-768_A-12' 
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
bert_preprocess = hub.load(tfhub_handle_preprocess)

标记化句子/单词：

tok = bert_preprocess.tokenize(tf.constant(['Tensorsss bla']))
print(tok)

# Output:
<tf.RaggedTensor [[[23435, 4757, 2015], [1038, 2721]]]>

tok=bert_preprocess.tokenize（tf.constant（['tensorssbla']））打印（tok） #输出：难道不是每个单词都被标记为一个标记吗？这些显然是虚构的词，但我想知道当你把这些词编码成固定长度的向量时会发生什么

此外，标记器如何转换3个不同标记中的合成词？它是否将未知字拆分为不同的已知部分？

tensorflow/bert_en_uncased_preprocess/3模型的默认缓存位置是

/tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0

（）。在

assets

目录中，您可以找到

vocab.txt

，这是使用过的词汇表。您可以使用该文件通过查看文件的行

i+1

来查找令牌id

对应的令牌，即

sed'23436q；d'/tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0/assets/vocab.txt
>张量

对所有令牌ID执行此操作将返回

[tensor, ##ss, ##s], [b, ##la]

如你所见，这证实了你的理论，即单词被分成不同的已知部分。有关精确算法的更多详细信息，请参阅