Python 空间注释工具实体索引
如何在Spacy中读取带注释的数据 1) 我的注释数据的格式:Python 空间注释工具实体索引,python,json,python-3.x,spacy,entities,Python,Json,Python 3.x,Spacy,Entities,如何在Spacy中读取带注释的数据 1) 我的注释数据的格式: "annotation": [ [ 79, 99, "Nom complet" ], 2) 脚本中注释数据的形式: "annotation": [ { "label": [ "Companies worked at" ], "points": [ { "start": 1749
"annotation": [
[
79,
99,
"Nom complet"
],
2) 脚本中注释数据的形式:
"annotation": [
{
"label": [
"Companies worked at"
],
"points": [
{
"start": 1749,
"end": 1754,
"text": "Oracle"
}
]
},
3) 如何更改可以读取注释数据的代码
for line in lines:
data = json.loads(line)
text = data['text']
entities = []
for annotation in data['annotation']:
#only a single point in text annotation.
point = annotation['points'][0]
labels = annotation['label']
# handle both list of labels or a single label.
if not isinstance(labels, list):
labels = [labels]
for label in labels:
dataturks indices are both inclusive [start, end] but spacy is not [start, end)
entities.append(([0], [1],[2]))
training_data.append((text, {"entities" : entities}))
培训Json:-
[{
“文本”:“本劳动合同(“合同”)于2017年5月12日(“生效日期”)生效,由客户ABC,Inc.(“客户-ABC”)和供应商ABC(“供应商”)共同订立,客户ABC,Inc.(“客户-ABC”)的主要营业地点位于美国佐治亚州亚特兰大市ABC街1030号,客户ABC街30318号,供应商ABC(“供应商”)的营业地点位于美国迈阿密公园大道100号,邮编10178(以下单独称为“一方”,合称为“双方”),
“实体”:[
[
50,
62,
“生效日期”
],
[
106,
116,
“供应商名称”
],
[
181,
203,
“供应商地址”
],
[
205,
212,
“卖方城市”
],
[
214,
216,
“供应商状态”
],
[
217,
222,
“供应商邮政编码”
],
[
224,
227,
“卖方所在国”
]
]
},{second training data}]
培训自定义代码:-
training_pickel_file = "training_pickel_file.json"
with open(training_pickel_file) as input:
TRAIN_DATA = json.load(input)
for annotations in TRAIN_DATA:
for ent in annotations["entities"]:
ner.add_label(ent[2])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for a in TRAIN_DATA:
doc = nlp.make_doc(a["text"])
gold = GoldParse(doc, entities = a["entities"])
nlp.update([doc], [gold], drop =0.5, sgd=optimizer, losses = losses)
print('Losses', losses)