Python 训练自定义模型

Python 训练自定义模型,python,machine-learning,nltk,spacy,ner,Python,Machine Learning,Nltk,Spacy,Ner,我一直在一些文本上训练我的NER模型,并试图在其中找到具有自定义实体的城市 例如:- ('paragraph Designated Offices Party A New York Party B Delaware paragraph pricing source calculation Market Value shall generally accepted pricing source reasonably agreed parties paragraph Spot rate Sp

我一直在一些文本上训练我的NER模型,并试图在其中找到具有自定义实体的城市

例如:-

    ('paragraph Designated Offices Party A New York Party B Delaware paragraph pricing source calculation Market Value shall generally accepted pricing source reasonably agreed parties paragraph Spot rate Spot Rate specified paragraph reasonably agreed parties',
  {'entities': [(37, 41, 'DesignatedBankLoc'),(54, 62, 'CounterpartyBankLoc')]})
我在这里寻找两个实体
指定银行锁定
交易对手银行锁定
。单个文本也可以有多个实体

目前,我正在对60行数据进行如下培训:

import spacy
import random
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)


    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            # print (ent[2])
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp


prdnlp = train_spacy(TRAIN_DATA, 100)
我的问题是:-

当输入不同/相同的文本模式包含经过训练的城市时,模型预测正确。 该模型不预测任何实体,即使相同/不同的文本模式,但不同的城市在训练数据集中从未出现


请告诉我发生的原因请让我了解它是如何得到训练的概念?

根据经验,您有60行数据,训练100次迭代。您过度拟合了实体的值,而不是它们的位置

要检查这一点,请尝试在句子中随机插入城市名称,看看会发生什么。如果算法对它们进行了标记,则可能是拟合过度

有两种解决方案:

  • 为这些实体创建更多具有更多不同值的训练数据
  • 测试不同的迭代次数

感谢您的回复,我想知道下降、迭代次数如何影响模型,以及如何检查拟合度?我尝试过使用相同迭代但下降值不同的训练模型。我在这两种情况下都遇到了损失。我如何比较这两种情况,并查看哪一种效果更好?