Python 训练自定义模型
我一直在一些文本上训练我的NER模型,并试图在其中找到具有自定义实体的城市 例如:-Python 训练自定义模型,python,machine-learning,nltk,spacy,ner,Python,Machine Learning,Nltk,Spacy,Ner,我一直在一些文本上训练我的NER模型,并试图在其中找到具有自定义实体的城市 例如:- ('paragraph Designated Offices Party A New York Party B Delaware paragraph pricing source calculation Market Value shall generally accepted pricing source reasonably agreed parties paragraph Spot rate Sp
('paragraph Designated Offices Party A New York Party B Delaware paragraph pricing source calculation Market Value shall generally accepted pricing source reasonably agreed parties paragraph Spot rate Spot Rate specified paragraph reasonably agreed parties',
{'entities': [(37, 41, 'DesignatedBankLoc'),(54, 62, 'CounterpartyBankLoc')]})
我在这里寻找两个实体指定银行锁定
和交易对手银行锁定
。单个文本也可以有多个实体
目前,我正在对60行数据进行如下培训:
import spacy
import random
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
# print (ent[2])
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
return nlp
prdnlp = train_spacy(TRAIN_DATA, 100)
我的问题是:-
当输入不同/相同的文本模式包含经过训练的城市时,模型预测正确。
该模型不预测任何实体,即使相同/不同的文本模式,但不同的城市在训练数据集中从未出现
请告诉我发生的原因请让我了解它是如何得到训练的概念?根据经验,您有60行数据,训练100次迭代。您过度拟合了实体的值,而不是它们的位置 要检查这一点,请尝试在句子中随机插入城市名称,看看会发生什么。如果算法对它们进行了标记,则可能是拟合过度 有两种解决方案:
- 为这些实体创建更多具有更多不同值的训练数据
- 测试不同的迭代次数