Python 变形金刚和伯特：编码时处理所有格和撇号我们考虑两个句子： "why isn't Alex's text tokenizing? The house on the left is the Smiths' house"_Python_Nlp_Huggingface Transformers

Python 变形金刚和伯特：编码时处理所有格和撇号我们考虑两个句子： "why isn't Alex's text tokenizing? The house on the left is the Smiths' house"

python nlp

Python 变形金刚和伯特：编码时处理所有格和撇号我们考虑两个句子： "why isn't Alex's text tokenizing? The house on the left is the Smiths' house",python,nlp,huggingface-transformers,Python,Nlp,Huggingface Transformers,现在让我们标记并解码： from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("why isn't Alex's text tokenizing? The house on

现在让我们标记并解码：

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")))

我们得到：

"why isn't alex's text tokenizing? the house on the left is the smiths'house"

我的问题是如何处理像smiths'house这样的所有格中缺少的空格？

对我来说，变形金刚中的标记化过程似乎并不正确。让我们考虑一下

的输出。

tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")

我们得到：

['why', 'isn', "'", 't', 'alex', "'", 's', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "'", 'house']

所以在这一步中，我们已经丢失了关于最后一个撇号的重要信息。如果以另一种方式进行标记化，效果会更好：

['why', 'isn', "##'", '##t', 'alex', "##'", '##s', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "##'", 'house']

通过这种方式，标记化保留了有关撇号的所有信息，我们不会遇到所有格的问题。

为什么您会有与此问题完全相同的示例文本句子？你是同一个用户吗？@stackoverflowuser2010否。刚才看到这个问题，我有自己的问题。为什么你有和这个问题完全相同的示例文本句子？你是同一个用户吗？@stackoverflowuser2010否。刚才看到这个问题，我有自己的问题。