Python SpaCy机器学习模型部分捕获较长的实体。如何解决?

Python SpaCy机器学习模型部分捕获较长的实体。如何解决?,python,nlp,spacy,ner,Python,Nlp,Spacy,Ner,我已经使用预先存在的en_core_web_sm-2.2.0模型在我的数据上训练了一个spaCy模型。我的数据中有一些实体是经过训练的模型部分捕获的 for text in ['KOYA MOTORS PRIVATE LTD.','KOYAL MOTORS PRIVATE LTD.' , 'PUTTAR MOTORS LIMITED' , 'BRENSON MOTORS LIMITED','MITASHI LIMITED','FEDERATION OF KARNATAKA CHAMBERS OF

我已经使用预先存在的en_core_web_sm-2.2.0模型在我的数据上训练了一个spaCy模型。我的数据中有一些实体是经过训练的模型部分捕获的

for text in ['KOYA MOTORS PRIVATE LTD.','KOYAL MOTORS PRIVATE LTD.' , 'PUTTAR MOTORS LIMITED' , 'BRENSON MOTORS LIMITED','MITASHI LIMITED','FEDERATION OF KARNATAKA CHAMBERS OF COMMERCE & INDUSTRY' ]:
    print("#####################")
    print(text , nlp_trained(text).ents)
    print("##")
    for i in nlp_trained(text):
        print(i,i.ent_iob_,i.ent_type_,i.pos_,i.tag_,i.head,i.lang_,i.lemma_)
输出:

#####################
KOYA MOTORS PRIVATE LTD. (MOTORS PRIVATE LTD.,)
##
KOYA O  PROPN NNP LTD en KOYA
MOTORS B ORG PROPN NNP LTD en MOTORS
PRIVATE I ORG PROPN NNP LTD en PRIVATE
LTD I ORG PROPN NNP LTD en LTD
. I ORG PUNCT . LTD en .
#####################
KOYAL MOTORS PRIVATE LTD. (KOYAL MOTORS PRIVATE LTD.,)
##
KOYAL B ORG PROPN NNP LTD en KOYAL
MOTORS I ORG PROPN NNP LTD en MOTORS
PRIVATE I ORG PROPN NNP LTD en PRIVATE
LTD I ORG PROPN NNP LTD en LTD
. I ORG PUNCT . LTD en .
#####################
PUTTAR MOTORS LIMITED (MOTORS LIMITED,)
##
PUTTAR O  NOUN NN LIMITED en puttar
MOTORS B ORG PROPN NNP LIMITED en MOTORS
LIMITED I ORG PROPN NNP LIMITED en LIMITED
#####################
BRENSON MOTORS LIMITED (BRENSON MOTORS LIMITED,)
##
BRENSON B ORG PROPN NNP LIMITED en BRENSON
MOTORS I ORG PROPN NNP LIMITED en MOTORS
LIMITED I ORG PROPN NNP LIMITED en LIMITED
#####################
MITASHI LIMITED ()
##
MITASHI O  PROPN NNP MITASHI en MITASHI
LIMITED O  PROPN NNP MITASHI en LIMITED
#####################
FEDERATION OF KARNATAKA CHAMBERS OF COMMERCE & INDUSTRY (KARNATAKA CHAMBERS OF COMMERCE & INDUSTRY,)
##
FEDERATION O  NOUN NN FEDERATION en federation
OF O  ADP IN FEDERATION en of
KARNATAKA B ORG PROPN NNP CHAMBERS en KARNATAKA
CHAMBERS I ORG NOUN NNS OF en chamber
OF I ORG ADP IN CHAMBERS en of
COMMERCE I ORG PROPN NNP OF en COMMERCE
& I ORG CCONJ CC COMMERCE en &
INDUSTRY I ORG PROPN NNP COMMERCE en INDUSTRY

这个问题的可能原因是什么?我如何纠正它?

Spacy的
en_core\u web\u sm-2.2.0
模型没有针对
KOYAL
KOYA
等词进行训练。一种让模型预测
KOYAL
KOYA
等词的方法,etc将更新
en\u core\u web\u sm-2.2.0
模型

您可以在中找到更多信息

代码应该如下所示:

import random
from spacy.gold import GoldParse
from cytoolz import partition_all
# training data
TRAIN_DATA = [
    ("Where is ICICI bank located", {"entities": [(9, 18, "ORG")]}),
    ("I like Thodupuzha and Pala", {"entities": [(7, 16, "LOC"), (22, 25, "LOC")]}),
    ("Thodupuzha is a tourist place", {"entities": [(0, 9, "LOC")]}),
    ("Pala is famous for mangoes", {"entities": [(0, 3, "LOC")]}),
    ("ICICI bank is one of the largest bank in the world", {"entities": [(0, 9, "ORG")]}),
    ("ICICI bank has a branch in Thodupuzha", {"entities": [(0, 9, "ORG"), (27, 36, "LOC")]}),
]
# preparing the revision data
revision_data = []
for doc in nlp.pipe(list(zip(*TRAIN_DATA))[0]):
    tags = [w.tag_ for w in doc]
    heads = [w.head.i for w in doc]
    deps = [w.dep_ for w in doc]
    entities = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
    revision_data.append((doc, GoldParse(doc, tags=tags, heads=heads,
                                         deps=deps, entities=entities)))
# preparing the fine_tune_data
fine_tune_data = []
for raw_text, entity_offsets in TRAIN_DATA:
    doc = nlp.make_doc(raw_text)
    gold = GoldParse(doc, entities=entity_offsets['entities'])
    fine_tune_data.append((doc, gold))
# training the model
n_epoch = 10
batch_size = 2
for i in range(n_epoch):
    examples = revision_data + fine_tune_data
    losses = {}
    random.shuffle(examples)
    for batch in partition_all(batch_size, examples):
        docs, golds = zip(*batch)
        nlp.update(docs, golds, drop=0.0, losses=losses)
# finding ner with the updated model
nytimes = nlp(sentence)
entities = [(i, i.label_, i.label) for i in nytimes.ents]
print(entities)

你在哪里找到它的部分捕获?它是否在nlp_trained(text).ents中是。在科亚汽车私人有限公司(MOTORS PRIVATE LTD.)中,括号中的一个是被捕获的实体。因此,捕获的是“MOTORS PRIVATE LTD.”而不是“KOYA MOTORS PRIVATE LTD.”。我已经使用上述方法训练了我的模型,这就是为什么我提到我根据数据训练了我的模型。我想了解我应该采取什么步骤来进一步改进我的模型。我希望有人能指出这种类型的模型输出对模型和我应该采取的培训行动意味着什么。你的培训数据有多大?