Python 变形金刚和伯特:编码时处理所有格和撇号 我们考虑两个句子: "why isn't Alex's text tokenizing? The house on the left is the Smiths' house"
现在让我们标记并解码:Python 变形金刚和伯特:编码时处理所有格和撇号 我们考虑两个句子: "why isn't Alex's text tokenizing? The house on the left is the Smiths' house",python,nlp,huggingface-transformers,Python,Nlp,Huggingface Transformers,现在让我们标记并解码: from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("why isn't Alex's text tokenizing? The house on
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")))
我们得到:
"why isn't alex's text tokenizing? the house on the left is the smiths'house"
我的问题是如何处理像smiths'house这样的所有格中缺少的空格?
对我来说,变形金刚中的标记化过程似乎并不正确。让我们考虑一下的输出。
tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")
我们得到:
['why', 'isn', "'", 't', 'alex', "'", 's', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "'", 'house']
所以在这一步中,我们已经丢失了关于最后一个撇号的重要信息。如果以另一种方式进行标记化,效果会更好:
['why', 'isn', "##'", '##t', 'alex', "##'", '##s', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "##'", 'house']
通过这种方式,标记化保留了有关撇号的所有信息,我们不会遇到所有格的问题。为什么您会有与此问题完全相同的示例文本句子?你是同一个用户吗?@stackoverflowuser2010否。刚才看到这个问题,我有自己的问题。为什么你有和这个问题完全相同的示例文本句子?你是同一个用户吗?@stackoverflowuser2010否。刚才看到这个问题,我有自己的问题。