如何使用python消除json文件中的冗余_Python_Json

如何使用python消除json文件中的冗余

python json

如何使用python消除json文件中的冗余,python,json,Python,Json,我有一个json文件，如下所示 { "question": "yellow skin around wound from cat bite. why?", "answer": "this may be the secondary result of a resolving bruise but a cat bite is a potentially serious and complicated wound and should be under the care of a physic

我有一个json文件，如下所示

{
  "question": "yellow skin around wound from cat bite. why?",
  "answer": "this may be the secondary result of a resolving bruise but a cat bite is a potentially serious and complicated wound and should be under the care of a physician.",
  "tags": [
    "wound care"
  ]
},
{
  "question": "yellow skin around wound from cat bite. why?",
  "answer": "see your doctor with all deliberate speed. or go to an urgent care center or a hospital emergency room. do it fast!",
  "tags": [
    "wound care"
  ]
},

正如您所看到的，重编部分仅在键的“问题”部分，但答案各不相同，这意味着这些数据是从论坛中提取的，并且它包含同一问题的不同答案，是否有一种方法使用pyton来消除冗余部分或将答案分组在一起。

谢谢

您可以在这里使用熊猫

import pandas as pd
a='''[{
  "question": "yellow skin around wound from cat bite. why?",
  "answer": "this may be the secondary result of a resolving bruise but a cat bite is a potentially serious and complicated wound and should be under the care of a physician.",
  "tags": [
    "wound care"
  ]
},
{
  "question": "yellow skin around wound from cat bite. why?",
  "answer": "see your doctor with all deliberate speed. or go to an urgent care center or a hospital emergency room. do it fast!",
  "tags": [
    "wound care"
  ]
}]'''
df = pd.read_json(a)
df.groupby(['question'])['answer'].apply(list).to_dict()

需要进行某种分组。有很多方法可以做到这一点，包括来自

itertools

模块的函数、外部模块（如

pandas

）和其他来源。这里有一种使用内置结构的方法，

defaultdict

：

从集合导入defaultdict
导入json
data=json.loads（rawdata）
问题=默认dict（列表）
对于数据中的行：
question=row.pop（'question'）
问题[问题].append（行）

结果将是一个字典

问题

（准确地说是

默认dict

），由问题键入，值给出结果的答案和标记。一个缺点是这会破坏性地改变原始解析的JSON数据。你可以用几种方法来解决这个问题，为了简洁起见，我将省略这些方法

下面是一个简化版的

问题

词典，其结果如下：

{'yellow skin…为什么？'：[{'answer'：'这可能是
“解决擦伤，但猫咬伤是一种痛苦”
“潜在的严重复杂伤口”
“并且应该由一个
“医生。”，
“标签”：[“伤口护理”]}，
{‘回答’：‘仔细看医生’
“速度，或者去急救中心，或者”
“医院急诊室，快点！”，
'标签'：['伤口护理']}]}

使用

json

模块加载文件，然后逐项添加/concat，并将它们添加到一个目录中此输入的预期输出是什么？您是否关心

“标记”

字段？你试过什么？顺便说一句，这个片段不是一个有效的JSON文档，它是一个更大的JSON的一部分，可能包含在一个JSON数组中吗？jdhesa先生，实际上它是一个更大的JSON文件的一部分，包含大约100万行，是我项目的数据集，“我正在努力清理它。这种方法并不能真正消除冗余，为什么呢？”HaithamHachicha。您将在一个列表中得到与最终词典中的关键问题相同的答案