Python 从训练模型中提取命名实体关系_Python_Nlp_Spacy_Named Entity Recognition_Named Entity Extraction

Python 从训练模型中提取命名实体关系

python nlp

Python 从训练模型中提取命名实体关系,python,nlp,spacy,named-entity-recognition,named-entity-extraction,Python,Nlp,Spacy,Named Entity Recognition,Named Entity Extraction,如何使用Spacy在传染病病例数的上下文中创建一个新的名称实体“cases”，然后提取该实体与病例数之间的依赖关系例如，在以下文本中，“其中，1995年10月9日至11月5日期间报告了879例，其中4例死亡。”我们希望提取“879”和“病例” 根据Spacy示例文档页面上的“培训其他实体类型”代码：我使用了他们现有的预培训“en_core_web_sm”英语模式，成功地培训了一个名为“CASES”的附加实体： from __future__ import unicode_literals,

如何使用Spacy在传染病病例数的上下文中创建一个新的名称实体“cases”，然后提取该实体与病例数之间的依赖关系

例如，在以下文本中，“其中，1995年10月9日至11月5日期间报告了879例，其中4例死亡。”我们希望提取“879”和“病例”

根据Spacy示例文档页面上的“培训其他实体类型”代码：

我使用了他们现有的预培训“en_core_web_sm”英语模式，成功地培训了一个名为“CASES”的附加实体：

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

LABEL = "CASES"

TRAIN_DATA = results_ent2[0:400]

def main(model="en_core_web_sm", new_model_name="cases", output_dir='data3', n_iter=30):
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model   

    test_text = "There were 100 confirmed cases?"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)F
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

main()

测试输出：

test_text = 'Of these, 879 cases with 4 deaths were reported for the period 9 October to 5 November 1995. John was infected. It cost $500'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
    print(ent.label_, ent.text)

我们得到了一个结果

Entities in 'Of these, 879 cases with 4 deaths were reported for the period 9 October to 5 November 1995. John was infected. It cost $500'
CARDINAL 879
CASES cases
CARDINAL 4
CARDINAL 9
CARDINAL 5
CARDINAL $500

模型已保存，可以从上述文本中正确识别案例

我的目标是从一篇新闻文章中提取特定疾病/病毒的病例数，然后再提取死亡人数

我现在使用这个新创建的模型，试图找到CASES和CARDINAL之间的依赖关系：

再次使用Spacy的例子

“训练spaCy的依赖项解析器”

import plac
import spacy


TEXTS = [
    "Net income was $9.4 million compared to the prior year of $2.7 million. I have 100,000 cases",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
    "Of these, 879 cases with 4 deaths were reported for the period 9 October to 5 November 1995. John was infected. It cost $500"
]


def main(model="data3"):
    nlp = spacy.load(model)
    print("Loaded model '%s'" % model)
    print("Processing %d texts" % len(TEXTS))

    for text in TEXTS:
        doc = nlp(text)
        relations = extract_currency_relations(doc)
        for r1, r2 in relations:
            print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text))


def filter_spans(spans):
    # Filter a sequence of spans so they don't contain overlaps
    # For spaCy 2.1.4+: this function is available as spacy.util.filter_spans()
    get_sort_key = lambda span: (span.end - span.start, -span.start)
    sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
    result = []
    seen_tokens = set()
    for span in sorted_spans:
        # Check for end - 1 here because boundaries are inclusive
        if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
            result.append(span)
        seen_tokens.update(range(span.start, span.end))
    result = sorted(result, key=lambda span: span.start)
    return result


def extract_currency_relations(doc):
    # Merge entities and noun chunks into one token
    spans = list(doc.ents) + list(doc.noun_chunks)
    spans = filter_spans(spans)
    with doc.retokenize() as retokenizer:
        for span in spans:
            retokenizer.merge(span)

    relations = []
    for money in filter(lambda w: w.ent_type_ == "MONEY", doc):
        if money.dep_ in ("attr", "dobj"):
            subject = [w for w in money.head.lefts if w.dep_ == "nsubj"]
            if subject:
                subject = subject[0]
                relations.append((subject, money))
        elif money.dep_ == "pobj" and money.head.dep_ == "prep":
            relations.append((money.head.head, money))
    return relations


main()

如果我使用原始的预训练模型“en_core_web_sm”，结果是：

Processing 3 texts
Net income  MONEY   $9.4 million
the prior year  MONEY   $2.7 million
Revenue     MONEY   twelve billion dollars
a loss      MONEY   1b

这与Spacy示例页面上的模型输出相同

有人知道发生了什么事吗？为什么我的新模型，在原来的Spacy“en_core_web_sm”上使用迁移学习，现在无法找到这个例子中的依赖项

编辑：

如果我使用更新的训练模型，它可以检测新的实体“案例”和基数“100000”，但是它失去了检测金钱和日期的能力

当我训练模型时，我训练了数千个句子，使用基础模型en_core_web_sm本身检测所有实体并标记它们，以避免模型“忘记”旧实体

如果我看到原文，请按我的说法

净收入为940万美元，上年为270万美元一百万我有10万箱

Spacy pretrained model将money、date和cardinal作为右侧返回，这是Spacy预定义的实体标签，但当您运行自定义模型data\u new时，您只将cases和cardinal作为实体标签，而不是money和date

原因是，当您使用自定义数据训练spacy模型时，您只注释了与基数和大小写对应的文本，并跳过了其他spacy预训练标签，如日期、货币、loc、org和norp。在这种情况下，灾难性遗忘被引入。请从这里阅读这样的概念

我的推荐

在注释过程中，应该为金钱、日期、基数、案例和其他您需要的内容设置平衡的标签。对于实时整体平衡是不可能的，但尽可能多地尝试

如果我看到原文，我会说

净收入为940万美元，上年为270万美元一百万我有10万箱

我的推荐

在注释过程中，应该为金钱、日期、基数、案例和其他您需要的内容设置平衡的标签。对于实时整体平衡是不可能的，但尽可能多地尝试

Processing 3 texts
Net income  MONEY   $9.4 million
the prior year  MONEY   $2.7 million
Revenue     MONEY   twelve billion dollars
a loss      MONEY   1b