Python 如何将组合spacy ner标记转换为BIO格式？_Python_Python 3.x_Nlp_Spacy_Ner

Python 如何将组合spacy ner标记转换为BIO格式？

python python-3.x nlp

Python 如何将组合spacy ner标记转换为BIO格式？,python,python-3.x,nlp,spacy,ner,Python,Python 3.x,Nlp,Spacy,Ner,如何将其转换为BIO格式？我曾尝试使用spacybiluo\u标签\u from\u offset，但它无法捕获所有实体，我想我知道原因 tags = biluo_tags_from_offsets(doc, annot['entities']) 理学学士（BSc）-这两者结合在一起，但当有空格时，spacy会分割文本。因此，现在的单词将类似于（BSc（学士、理学学士），这就是spacybiluo\u标记来自偏移量失败并返回- 现在，当它检查（80,83，'Degree'）时，它无法单独找到B

如何将其转换为BIO格式？我曾尝试使用spacy

biluo\u标签\u from\u offset

，但它无法捕获所有实体，我想我知道原因

tags = biluo_tags_from_offsets(doc, annot['entities'])

理学学士（BSc）-这两者结合在一起，但当有空格时，spacy会分割文本。因此，现在的单词将类似于（

BSc（学士、理学学士

），这就是spacy

biluo\u标记来自偏移量

失败并返回

现在，当它检查

（80,83，'Degree'）

时，它无法单独找到BSc单词。同样，它将再次失败

（84,103，'Degree'）

如何修复这些场景？如果有人可以，请帮助

通常，您可以将数据从偏移量（doc，entities）传递到

biluo\u标记中，其中实体
类似于[（14,44，'ORG'），（51,54，'ORG'）]
。您可以根据需要编辑此参数（您可以从编辑doc.ents开始，也可以从那里继续）。您可以添加、删除、合并此列表中的任何实体，如以下示例所示：
import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

text = "I have a BSc (Bachelors of Computer Sciences) from NYU"
doc = nlp(text)
print("Entities before adding new entity:", doc.ents)

entities = []
for ent in doc.ents:
    entities.append((ent.start_char, ent.end_char, ent.label_))
print("BILUO before adding new entity:", biluo_tags_from_offsets(doc, entities))

entities.append((9,12,'ORG')) # add a desired entity

print("BILUO after adding new entity:", biluo_tags_from_offsets(doc, entities))

Entities before adding new entity: (Bachelors of Computer Sciences, NYU)
BILUO before adding new entity: ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']
BILUO after adding new entity: ['O', 'O', 'O', 'U-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']

如果希望合并实体的过程基于规则，可以尝试以下简化示例（取自上面的链接）：
然后再次将重新定义的（在您的案例中合并的）实体列表传递给biluo\u标记\u from\u offset
，就像在第一个代码片段中一样，您可以尝试将标记与Doc.retokenize（）组合吗如中所示？很有意思的是，看看预训练模型是否仍能识别新的组合标记。@SergeyBushmanov您能否提供一个工作示例，我无法从该链接中正确理解它，retokenize具体做了什么（）@SergeyBushmanov我在网上读到，spacy不支持重叠实体？我有没有办法解决这些问题。我找不到任何关于如何解决这些问题的好文章和工作示例？如果你熟悉，请帮助我。你也可以查看这个@SergeyBushmanov我在研究时读过。B但在我的例子中，重叠的实体是两个不同的标签。我如何将两个实体合并到一个单词上？我无法理解如何围绕它构建一个ner。如果您熟悉工作流程，请帮助我解决。我已经为此困扰了数周。我的数据集有两个问题，一个是我上面列出的，另一个是重叠的实体S.@ UsReS12能回答你的问题吗？有帮助吗？请考虑
import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

text = "I have a BSc (Bachelors of Computer Sciences) from NYU"
doc = nlp(text)
print("Entities before adding new entity:", doc.ents)

entities = []
for ent in doc.ents:
    entities.append((ent.start_char, ent.end_char, ent.label_))
print("BILUO before adding new entity:", biluo_tags_from_offsets(doc, entities))

entities.append((9,12,'ORG')) # add a desired entity

print("BILUO after adding new entity:", biluo_tags_from_offsets(doc, entities))

Entities before adding new entity: (Bachelors of Computer Sciences, NYU)
BILUO before adding new entity: ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']
BILUO after adding new entity: ['O', 'O', 'O', 'U-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])