Nlp 将保存的NER加载回HuggingFace管道？_Nlp_Named Entity Recognition_Huggingface Transformers_Huggingface Tokenizers

Nlp 将保存的NER加载回HuggingFace管道？

nlp

Nlp 将保存的NER加载回HuggingFace管道？,nlp,named-entity-recognition,huggingface-transformers,huggingface-tokenizers,Nlp,Named Entity Recognition,Huggingface Transformers,Huggingface Tokenizers,我正在研究HuggingFace的迁移学习功能（特别是命名实体识别）。在前言中，我对transformer架构有点陌生。我在他们的网站上简要介绍了他们的示例： from transformers import pipeline nlp = pipeline("ner") sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,

我正在研究HuggingFace的迁移学习功能（特别是命名实体识别）。在前言中，我对transformer架构有点陌生。我在他们的网站上简要介绍了他们的示例：

from transformers import pipeline

nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
       "close to the Manhattan Bridge which is visible from the window."

print(nlp(sequence))

我想做的是在本地保存并运行它，而无需每次下载“ner”模型（大小超过1GB）。在他们的文档中，我看到可以使用“pipeline.save_pretrained（）”函数将管道保存到本地文件夹。结果是我将各种文件存储到一个特定的文件夹中

我的问题是，如何将此模型加载到脚本中，以便在保存后继续按照上面的示例进行分类？“pipeline.save_pretrained（）”的输出是多个文件

以下是我迄今为止所尝试的：

1：遵循关于管道的文档

pipe = transformers.TokenClassificationPipeline(model="pytorch_model.bin", tokenizer='tokenizer_config.json')

我得到的错误是：“str”对象没有属性“config”

2：以下是ner上的HuggingFace示例：

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("path to folder following .save_pretrained()")
tokenizer = AutoTokenizer.from_pretrained("path to folder following .save_pretrained()")

label_list = [
"O",       # Outside of a named entity
"B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC",  # Miscellaneous entity
"B-PER",   # Beginning of a person's name right after another person's name
"I-PER",   # Person's name
"B-ORG",   # Beginning of an organisation right after another organisation
"I-ORG",   # Organisation
"B-LOC",   # Beginning of a location right after another location
"I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
       "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

这将产生一个错误：列表索引超出范围

我还尝试只打印预测，而不返回标记及其实体的文本格式

任何帮助都将不胜感激

关于您的第一次尝试，model和tokenizer不是一个单独的文件。两者都应该是包含save_pretrained输出的文件夹。您解决了这个问题吗？我也在尝试“一次性”加载管道，但找不到任何关于它的文档。。