Python 空间注释工具实体索引_Python_Json_Python 3.x_Spacy_Entities

Python 空间注释工具实体索引

python json python-3.x

Python 空间注释工具实体索引,python,json,python-3.x,spacy,entities,Python,Json,Python 3.x,Spacy,Entities,如何在Spacy中读取带注释的数据 1）我的注释数据的格式： "annotation": [ [ 79, 99, "Nom complet" ], 2）脚本中注释数据的形式： "annotation": [ { "label": [ "Companies worked at" ], "points": [ { "start": 1749

如何在Spacy中读取带注释的数据

1）我的注释数据的格式：

  "annotation": [
    [
      79,
      99,
      "Nom complet"
    ],

2）脚本中注释数据的形式：

  "annotation": [
    {
      "label": [
        "Companies worked at"
      ],
      "points": [
        {
          "start": 1749,
          "end": 1754,
          "text": "Oracle"
        }
      ]
    },

3）如何更改可以读取注释数据的代码

for line in lines:
    data = json.loads(line)
    text = data['text']
    entities = []
    for annotation in data['annotation']:
        #only a single point in text annotation.
        point = annotation['points'][0]
        labels = annotation['label']
        # handle both list of labels or a single label.
        if not isinstance(labels, list):
            labels = [labels]

        for label in labels:
            dataturks indices are both inclusive [start, end] but spacy is not [start, end)
    entities.append(([0], [1],[2]))


    training_data.append((text, {"entities" : entities}))

培训Json：-

[{
“文本”：“本劳动合同（“合同”）于2017年5月12日（“生效日期”）生效，由客户ABC，Inc.（“客户-ABC”）和供应商ABC（“供应商”）共同订立，客户ABC，Inc.（“客户-ABC”）的主要营业地点位于美国佐治亚州亚特兰大市ABC街1030号，客户ABC街30318号，供应商ABC（“供应商”）的营业地点位于美国迈阿密公园大道100号，邮编10178（以下单独称为“一方”，合称为“双方”），
“实体”：[
[
50,
62,
“生效日期”
],
[
106,
116,
“供应商名称”
],
[
181,
203,
“供应商地址”
],
[
205,
212,
“卖方城市”
],
[
214,
216,
“供应商状态”
],
[
217,
222,
“供应商邮政编码”
],
[
224,
227,
“卖方所在国”
]
]
}，{second training data}]

培训自定义代码：-

training_pickel_file = "training_pickel_file.json"
with open(training_pickel_file) as input:
TRAIN_DATA = json.load(input)
for annotations in TRAIN_DATA:
   for ent in annotations["entities"]:
      ner.add_label(ent[2])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for a in TRAIN_DATA:
            doc = nlp.make_doc(a["text"])
            gold = GoldParse(doc, entities = a["entities"])
            nlp.update([doc], [gold], drop =0.5, sgd=optimizer, losses = losses)
        print('Losses', losses)