Python BERT-修改run_squad.py预测文件
我是BERT的新手,我正在尝试编辑的输出,用于构建问答系统,并获得具有以下结构的输出文件:Python BERT-修改run_squad.py预测文件,python,json,python-3.x,tensorflow,bert-language-model,Python,Json,Python 3.x,Tensorflow,Bert Language Model,我是BERT的新手,我正在尝试编辑的输出,用于构建问答系统,并获得具有以下结构的输出文件: { "data": [ { "id": "ID1", "title": "Alan_Turing", "question": "When Alan Turing was born?", "context": "Alan Mathison Turing (23 June 1912 – 7 Ju
{
"data": [
{
"id": "ID1",
"title": "Alan_Turing",
"question": "When Alan Turing was born?",
"context": "Alan Mathison Turing (23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. [...] . However, both Julius and Ethel wanted their children to be brought up in Britain, so they moved to Maida Vale, London, where Alan Turing was born on 23 June 1912, as recorded by a blue plaque on the outside of the house of his birth, later the Colonnade Hotel. Turing had an elder brother, John (the father of Sir John Dermot Turing, 12th Baronet of the Turing baronets).",
"answers": [
{"text": "on 23 June 1912", "probability": 0.891726, "start_logit": 4.075, "end_logit": 4.15},
{"text": "on 23 June", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
{"text": "June 1912", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
]
},
{
"id": "ID2",
"title": "Title2",
"question": "Question2",
"context": "Context 2 ...",
"answers": [
{"text": "text1", "probability": 0.891726, "start_logit": 4.075, "end_logit": 4.15},
{"text": "text2", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
{"text": "text3", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
]
}
]
}
首先,在BERT的read_-squad_示例
函数(第227行)中将一个班json文件(输入文件)读入一个SquadeSample列表中,该文件包含我需要的前四个字段:id、标题、问题和上下文
之后,方形样本转换为特征,然后可以开始写入预测
阶段(第741行)
在write_predictions
BERT中,编写一个名为nbest_predictions.json
的输出文件,其中包含特定上下文的所有可能答案以及相关概率
在第891-898行,我想我需要的最后四个字段(text、probability、start\u logit、end\u logit)是附加的:
nbest_json = []
for (i, entry) in enumerate(nbest):
output = collections.OrderedDict()
output["text"] = entry.text
output["probability"] = probs[i]
output["start_logit"] = entry.start_logit
output["end_logit"] = entry.end_logit
nbest_json.append(output)
输出文件nbest_predictions.json具有以下结构:
{
"ID-1": [
{
"text": "text1",
"probability": 0.3617,
"start_logit": 4.0757,
"end_logit": 4.1554
}, {
"text": "text2",
"probability": 0.0036,
"start_logit": -0.5180,
"end_logit": 4.1554
}
],
"ID-2": [
{
"text": "text1",
"probability": 0.2487,
"start_logit": -1.6009,
"end_logit": -0.2818
}, {
"text": "text2",
"probability": 0.0070,
"start_logit": -0.9566,
"end_logit": -1.5770
}
]
}
现在…我不太明白nbest_预测文件是如何生成的。如何编辑此函数并获得一个json文件,其结构如我在文章开头所述?
考虑到这一点,我认为我有两种可能性:
write_predictions
函数以获得nbest_predictions.json
以我想要的方式结构化import json
import tensorflow as tf
def read_squad_examples2(input_file, is_training):
# SQUAD json file to list of SquadExamples #
with tf.gfile.Open(input_file, "r") as reader:
input_data = json.load(reader)["data"]
def is_whitespace(c):
if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
return True
return False
data = {}
sup_data = []
for entry in input_data:
entry_title = entry["title"]
data["title"] = entry_title;
for paragraph in entry["paragraphs"]:
paragraph_text = paragraph["context"]
data["context"] = paragraph_text;
for qa in paragraph["qas"]:
qas_id = qa["id"]
data["id"] = qas_id;
question_text = qa["question"]
data["question"] = question_text
sup_data.append(data)
my_json = json.dumps(sup_data)
return my_json
我得到的是:
[{
"question": "Question 1?",
"id": "ID 1 ",
"context": "The context 1",
"title": "Title 1"
}, {
"question": "Question 2?",
"id": "ID 2 ",
"context": "The context 2",
"title": "Title 2"
}]
在这一点上,我如何将包含“text”、“probability”、“start\u logit”和“end\u logit”的字段answers
附加到这个数据结构中
提前谢谢