Python 培训现有Spacy Ner管道会忘记前面的示例_Python_Spacy_Ner

Python 培训现有Spacy Ner管道会忘记前面的示例

python

Python 培训现有Spacy Ner管道会忘记前面的示例,python,spacy,ner,Python,Spacy,Ner,我正在为命名实体识别创建一个新模型。我有一个训练数据集，看起来像： [[ "world leading global energy trading company has an exciting opportunity for a senior csharp application platform developer to join its systematic trading division developing new trading applications tools and

我正在为命名实体识别创建一个新模型。我有一个训练数据集，看起来像：

[[
"world leading global energy trading company has an exciting 
opportunity for a senior csharp application platform developer to 
join its systematic trading division developing new trading 
applications tools and solutions for its successful front office 
trading team", 
{"entities": [[80, 86, "RNK"]]}
]]

因此，我运行以下函数来训练模型

@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", 
str),
new_model_name=("New model name for model meta.", "option", "nm", 
str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int), 
entity=("Name of the entity to be trained", "option", "e", str),
label=("The label to be given to the trained entity", "option", "l", 
str),
)
def main(model = None, new_model_name=("ner%s" % str(datetime.now)), 
output_dir= None, n_iter=20, entity=None, label=None):

if entity is None or label is None:
    log.info("Entity and Label must both be supplied")
    log.info("Bailing out as nothing to do ...... :-(")
    return

log.info("Fetching training data for entity [%s] to be trained with 
label [%s]" % (entity, label))

log.info("Training data retrieved and the first row is : ")
log.info(TRAIN_DATA[0])
log.info("There are %d rows to be trained" % len(TRAIN_DATA))

if model is not None:
    nlp = spacy.load(output_dir)  # load existing spaCy model
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank("en")  # create blank Language class
    print("Created blank 'en' model")

if "ner" not in nlp.pipe_names:
    log.info("ner not in pipe names, adding it in now ....")
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner)
# otherwise, get it, so we can add labels to it
else:
    log.info("retrieving previous ner pipe now ....")
    ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
     for ent in annotations.get('entities'):
        ner.add_label(ent[2])

move_names = list(ner.move_names)

# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in 
pipe_exceptions]
with nlp.disable_pipes(*other_pipes):  # only train NER
    # reset and initialize the weights randomly – but only if we're
    # training a new model
    if model is None:
        optimizer=nlp.begin_training()
    else:
        optimizer=nlp.resume_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0,

（1.001）对于分批处理：文本，注释=zip（*批） nlp.update( 文本，#一批文本批注，#批批注辍学=0.5，#辍学-使记忆更加困难数据 sgd=优化器，损失=损失， ) 打印（“损失”，损失）

这工作得很好，然后我能够使用以下函数成功地测试模型

@plac.annotations(
model_dir=("Optional output directory", "option", "o", Path),
test_text=("The test text to be used to test the model","option", 
"t", str),
entity=("Name of the entity to be trained", "option", "e", str),
label=("The label to be given to the trained entity", "option", "l", 
str),
)
def main(model_dir= None,test_text=None, entity=None, label=None):

if entity is None or label is None:
    log.info("Entity and Label must both be supplied")
    log.info("Bailing out as nothing to do ...... :-(")
    return

if test_text is None:
    test_text = ("Using default test string which is not optimal to 
look for %s" % entity)

nlp = spacy.load(model_dir)
log.info("Loaded model %s" % nlp.meta["name"])
log.info("Testing the string %s" % test_text)

ner = nlp.get_pipe("ner")
for label in ner.labels:
    log.info("NER Label : %s found in model" % label)

doc = nlp(test_text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

if __name__ == "__main__":
log = logging.getLogger(__name__)
log.setLevel(logging.DEBUG)

consoleHandler = logging.StreamHandler()
consoleHandler.setLevel(logging.DEBUG)

log.addHandler(consoleHandler)
plac.call(main)

但是，如果我再次运行代码以针对新标签进行训练，比如说“SKILL”，并提供新的训练数据，它会成功地加载旧模型，检索Ner管道并将优化器设置为继续，但是当我再次测试它时，它已经忘记了所有关于RNK标签的训练

我假设resume会以某种方式恢复以前的模型状态，并保留以前学到的注释。它当然保留了NER的标签

为什么会这样

考虑到这可能与灾难性遗忘问题有关，我创建了一大组训练数据，其中包括两个类别的示例，即：

'TOKEN'  >>> 'NER Annotation'
-----------------------------
'senior' >>> 'RNK'
'csharp' >>> 'SKILL'
'sql'    >>> 'SKILL'

这些就是损失

Losses {'ner': 721.8737016180717}
Losses {'ner': 5.999976082008388}
Losses {'ner': 5.970323057037423}
Losses {'ner': 5.996330579093365}
Losses {'ner': 6.028536462566022}
Losses {'ner': 12.043830573641666}
Losses {'ner': 10.001897952651317}
Losses {'ner': 6.016950026187274}
Losses {'ner': 6.624311646328313} 
Losses {'ner': 10.602919933949224}
Losses {'ner': 6.1062697231067995}
Losses {'ner': 8.792055106010444}
Losses {'ner': 13.302123281119345}
Losses {'ner': 6.068028368915684}
Losses {'ner': 8.026694430880903}
Losses {'ner': 8.961434860193798}
Losses {'ner': 6.02721516249698}
Losses {'ner': 9.714660156853073}
Losses {'ner': 4.108544494319015}
Losses {'ner': 6.023105974059858}
Losses {'ner': 7.357760648981275}
Losses {'ner': 6.295292869532734}
Losses {'ner': 3.8088561052881995}
Losses {'ner': 6.059279332644757}
Losses {'ner': 7.024559462190113}
Losses {'ner': 4.784358718788942}
Losses {'ner': 5.935101364429172}
Losses {'ner': 4.027772727507415}
Losses {'ner': 2.1748163004265884}
Losses {'ner': 5.993975825343896}

我已将我的培训数据包含在pastebin中，可在此处找到：

我也尝试过在保存到磁盘之前一次性培训一个包含所有数据和注释的全新模型，这没有什么区别。我也尝试过在保存到磁盘之前一次性培训一个包含所有数据和注释的全新模型，这没有什么区别。

Losses {'ner': 721.8737016180717}
Losses {'ner': 5.999976082008388}
Losses {'ner': 5.970323057037423}
Losses {'ner': 5.996330579093365}
Losses {'ner': 6.028536462566022}
Losses {'ner': 12.043830573641666}
Losses {'ner': 10.001897952651317}
Losses {'ner': 6.016950026187274}
Losses {'ner': 6.624311646328313} 
Losses {'ner': 10.602919933949224}
Losses {'ner': 6.1062697231067995}
Losses {'ner': 8.792055106010444}
Losses {'ner': 13.302123281119345}
Losses {'ner': 6.068028368915684}
Losses {'ner': 8.026694430880903}
Losses {'ner': 8.961434860193798}
Losses {'ner': 6.02721516249698}
Losses {'ner': 9.714660156853073}
Losses {'ner': 4.108544494319015}
Losses {'ner': 6.023105974059858}
Losses {'ner': 7.357760648981275}
Losses {'ner': 6.295292869532734}
Losses {'ner': 3.8088561052881995}
Losses {'ner': 6.059279332644757}
Losses {'ner': 7.024559462190113}
Losses {'ner': 4.784358718788942}
Losses {'ner': 5.935101364429172}
Losses {'ner': 4.027772727507415}
Losses {'ner': 2.1748163004265884}
Losses {'ner': 5.993975825343896}