Python BERT-修改run_squad.py预测文件_Python_Json_Python 3.x_Tensorflow_Bert Language Model

Python BERT-修改run_squad.py预测文件

python json python-3.x tensorflow

Python BERT-修改run_squad.py预测文件,python,json,python-3.x,tensorflow,bert-language-model,Python,Json,Python 3.x,Tensorflow,Bert Language Model,我是BERT的新手，我正在尝试编辑的输出，用于构建问答系统，并获得具有以下结构的输出文件： { "data": [ { "id": "ID1", "title": "Alan_Turing", "question": "When Alan Turing was born?", "context": "Alan Mathison Turing (23 June 1912 – 7 Ju

我是BERT的新手，我正在尝试编辑的输出，用于构建问答系统，并获得具有以下结构的输出文件：

{
    "data": [
      {
            "id": "ID1",
            "title": "Alan_Turing",
            "question": "When Alan Turing was born?",
            "context": "Alan Mathison Turing (23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. [...] . However, both Julius and Ethel wanted their children to be brought up in Britain, so they moved to Maida Vale, London, where Alan Turing was born on 23 June 1912, as recorded by a blue plaque on the outside of the house of his birth, later the Colonnade Hotel. Turing had an elder brother, John (the father of Sir John Dermot Turing, 12th Baronet of the Turing baronets).",
            "answers": [
              {"text": "on 23 June 1912",   "probability": 0.891726, "start_logit": 4.075,  "end_logit": 4.15},
              {"text": "on 23 June", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
              {"text": "June 1912", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
            ]
        },
        {
            "id": "ID2",
            "title": "Title2",
            "question": "Question2",
            "context": "Context 2 ...",
            "answers": [
              {"text": "text1", "probability": 0.891726, "start_logit": 4.075, "end_logit": 4.15},
              {"text": "text2", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
              {"text": "text3", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
            ]
        }
    ]
}

首先，在BERT的

read_-squad_示例

函数（第227行）中将一个班json文件（输入文件）读入一个SquadeSample列表中，该文件包含我需要的前四个字段：id、标题、问题和上下文

之后，方形样本转换为特征，然后可以开始

写入预测

阶段（第741行）

在

write_predictions

BERT中，编写一个名为

nbest_predictions.json

的输出文件，其中包含特定上下文的所有可能答案以及相关概率

在第891-898行，我想我需要的最后四个字段（text、probability、start\u logit、end\u logit）是附加的：

nbest_json = []
    for (i, entry) in enumerate(nbest):
      output = collections.OrderedDict()
      output["text"] = entry.text
      output["probability"] = probs[i]
      output["start_logit"] = entry.start_logit
      output["end_logit"] = entry.end_logit
nbest_json.append(output)

输出文件nbest_predictions.json具有以下结构：

{
    "ID-1": [
        {
            "text": "text1", 
            "probability": 0.3617, 
            "start_logit": 4.0757, 
            "end_logit": 4.1554
        }, {
            "text": "text2", 
            "probability": 0.0036, 
            "start_logit": -0.5180, 
            "end_logit": 4.1554
        }
    ], 
    "ID-2": [
        {
            "text": "text1", 
            "probability": 0.2487, 
            "start_logit": -1.6009, 
            "end_logit": -0.2818
        }, {
            "text": "text2", 
            "probability": 0.0070, 
            "start_logit": -0.9566, 
            "end_logit": -1.5770
        }
    ]
}

现在…我不太明白nbest_预测文件是如何生成的。如何编辑此函数并获得一个json文件，其结构如我在文章开头所述？

考虑到这一点，我认为我有两种可能性：

创建一个新的数据结构并附加我需要的字段

编辑

write_predictions

函数以获得

nbest_predictions.json

以我想要的方式结构化

最佳解决方案是什么？

目前，我编写了一个新函数，用于读取输入文件并将我的id、标题、问题和上下文附加到数据结构中：

import json
import tensorflow as tf


def read_squad_examples2(input_file, is_training):
  # SQUAD json file to list of SquadExamples #
  with tf.gfile.Open(input_file, "r") as reader:
    input_data = json.load(reader)["data"]

  def is_whitespace(c):
    if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
      return True
    return False

  data = {}
  sup_data = [] 

  for entry in input_data:
    entry_title = entry["title"]
    data["title"] = entry_title;
    for paragraph in entry["paragraphs"]:
      paragraph_text = paragraph["context"]
      data["context"] = paragraph_text;
      for qa in paragraph["qas"]:
        qas_id = qa["id"]
        data["id"] = qas_id;
        question_text = qa["question"]
        data["question"] = question_text

        sup_data.append(data)

  my_json = json.dumps(sup_data)

  return my_json

我得到的是：

[{
    "question": "Question 1?",
    "id": "ID 1 ",
    "context": "The context 1",
    "title": "Title 1"
}, {
    "question": "Question 2?",
    "id": "ID 2 ",
    "context": "The context 2",
    "title": "Title 2"
}]

在这一点上，我如何将包含“text”、“probability”、“start\u logit”和“end\u logit”的字段

answers

附加到这个数据结构中

提前谢谢