Python 对数据帧的深度嵌套JSON响应_Python_Json_Pandas_Nested_Resultset

Python 对数据帧的深度嵌套JSON响应

python json pandas

Python 对数据帧的深度嵌套JSON响应,python,json,pandas,nested,resultset,Python,Json,Pandas,Nested,Resultset,我是python/pandas新手，在将嵌套JSON转换为pandas数据帧时遇到一些问题。我将向数据库发送一个查询并返回一个JSON字符串它是一个包含多个数组的深度嵌套JSON字符串。来自数据库的响应包含数千行。以下是JSON字符串中一行的一般结构： { "ID": "123456", "profile": { "criteria": [ { "type": "type1", "name": "name1", "va

我是python/pandas新手，在将嵌套JSON转换为pandas数据帧时遇到一些问题。我将向数据库发送一个查询并返回一个JSON字符串

它是一个包含多个数组的深度嵌套JSON字符串。来自数据库的响应包含数千行。以下是JSON字符串中一行的一般结构：

{
  "ID": "123456",
  "profile": {
    "criteria": [
      {
        "type": "type1",
        "name": "name1",
        "value": "7",
        "properties": []
      },
      {
        "type": "type2",
        "name": "name2",
        "value": "6",
        "properties": [
          {
            "type": "MAX",
            "name": "",
            "value": "100"
          },
          {
            "type": "MIN",
            "name": "",
            "value": "5"
          }
        ]
      },
      {
        "type": "type3",
        "name": "name3",
        "value": "5",
        "properties": []
      }
    ]
  }
}  
{
  "ID": "456789",
  "profile": {
    "criteria": [
      {
        "type": "type4",
        "name": "name4",
        "value": "6",
        "properties": []
      }
    ]
  }
}

from cassandra.cluster import Cluster
import pandas as pd
from pandas.io.json import json_normalize

def pandas_factory(colnames, rows):
    return pd.DataFrame(rows, columns=colnames)

cluster = Cluster(['xxx.xx.x.xx'], port=yyyy)
session = cluster.connect('nnnn')

session.row_factory = pandas_factory

json_string = session.execute('select json ......')
df = json_string ._current_rows
df_normalized= json_normalize(df)
print(df_normalized)

我想使用python将这个JSON字符串展平。我在使用json_normalize时遇到问题，因为这是一个嵌套很深的json字符串：

{
  "ID": "123456",
  "profile": {
    "criteria": [
      {
        "type": "type1",
        "name": "name1",
        "value": "7",
        "properties": []
      },
      {
        "type": "type2",
        "name": "name2",
        "value": "6",
        "properties": [
          {
            "type": "MAX",
            "name": "",
            "value": "100"
          },
          {
            "type": "MIN",
            "name": "",
            "value": "5"
          }
        ]
      },
      {
        "type": "type3",
        "name": "name3",
        "value": "5",
        "properties": []
      }
    ]
  }
}  
{
  "ID": "456789",
  "profile": {
    "criteria": [
      {
        "type": "type4",
        "name": "name4",
        "value": "6",
        "properties": []
      }
    ]
  }
}

from cassandra.cluster import Cluster
import pandas as pd
from pandas.io.json import json_normalize

def pandas_factory(colnames, rows):
    return pd.DataFrame(rows, columns=colnames)

cluster = Cluster(['xxx.xx.x.xx'], port=yyyy)
session = cluster.connect('nnnn')

session.row_factory = pandas_factory

json_string = session.execute('select json ......')
df = json_string ._current_rows
df_normalized= json_normalize(df)
print(df_normalized)

运行此代码时，我得到一个密钥错误：

KeyError: 0

我需要帮助将这个JSON字符串转换成一个dataframe，其中只包含一些看起来像这样的选定列：（可以跳过其余的数据）

我试图在这里找到类似的问题，但似乎无法将其应用于JSON字符串

感谢您的帮助！：）

编辑：

返回的json字符串是一个查询响应对象：ResultSet。我想这就是为什么我在使用时遇到一些问题：

json_string= session.execute('select json profile from visning')
temp = json.loads(json_string)

以及获取错误：

TypeError: the JSON object must be str, not 'ResultSet'

编辑#2:

为了了解我在使用什么，我使用以下方法打印了结果查询：

for line in session.execute('select json.....'):
    print(line)

得到了这样的结果：

Row(json='{"ID": null, "profile": null}')
Row(json='{"ID": "123", "profile": {"criteria": [{"type": "type1", "name": "name1", "value": "10", "properties": []}, {"type": "type2", "name": "name2", "value": "50", "properties": []}, {"type": "type3", "name": "name3", "value": "40", "properties": []}]}}')
Row(json='{"ID": "456", "profile": {"criteria": []}}')
Row(json='{"ID": "789", "profile": {"criteria": [{"type": "type4", "name": "name4", "value": "5", "properties": []}]}}')
Row(json='{"ID": "987", "profile": {"criteria": [{"type": "type5", "name": "name5", "value": "70", "properties": []}, {"type": "type6", "name": "name6", "value": "60", "properties": []}, {"type": "type7", "name": "name7", "value": "2", "properties": []}, {"type": "type8", "name": "name8", "value": "7", "properties": []}]}}')

我遇到的问题是将此结构转换为可在json.loads（）中使用的json字符串：

获取输出：

{"ID": null, "profile": null}
{"ID": "123", "profile": {"criteria": [{"type": "type1", "name": "name1", "value": "10", "properties": []}, {"type": "type2", "name": "name2", "value": "50", "properties": []}, {"type": "type3", "name": "name3", "value": "40", "properties": []}]}}
{"ID": "456", "profile": {"criteria": []}}
{"ID": "789", "profile": {"criteria": [{"type": "type4", "name": "name4", "value": "5", "properties": []}]}}
{"ID": "987", "profile": {"criteria": [{"type": "type5", "name": "name5", "value": "70", "properties": []}, {"type": "type6", "name": "name6", "value": "60", "properties": []}, {"type": "type7", "name": "name7", "value": "2", "properties": []}, {"type": "type8", "name": "name8", "value": "7", "properties": []}]}}

一个硬编码的例子

import pandas as pd

temp = [{
  "ID": "123456",
  "profile": {
    "criteria": [
      {
        "type": "type1",
        "name": "name1",
        "value": "7",
        "properties": []
      },
      {
        "type": "type2",
        "name": "name2",
        "value": "6",
        "properties": [
          {
            "type": "MAX",
            "name": "",
            "value": "100"
          },
          {
            "type": "MIN",
            "name": "",
            "value": "5"
          }
        ]
      },
      {
        "type": "type3",
        "name": "name3",
        "value": "5",
        "properties": []
      }
    ]
  }
},
{
  "ID": "456789",
  "profile": {
    "criteria": [
      {
        "type": "type4",
        "name": "name4",
        "value": "6",
        "properties": []
      }
    ]
  }
}]

cols = ['ID', 'criteria', 'type', 'name', 'value']

rows = []
for data in temp:
    data_id = data['ID']
    criteria = data['profile']['criteria']
    for d in criteria:
        rows.append([data_id, criteria.index(d)+1, *list(d.values())[:-1]])

df = pd.DataFrame(rows, columns=cols)

这一点也不优雅。它更像是一个快速而肮脏的解决方案，因为我不知道JSON数据是如何精确格式化的——但是，根据您提供的内容，我上面的代码将生成所需的数据帧

       ID  criteria   type   name value
0  123456         1  type1  name1     7
1  123456         2  type2  name2     6
2  123456         3  type3  name3     5
3  456789         1  type4  name4     6

此外，如果需要“加载”JSON数据，可以使用

JSON

库，如下所示：

import json

temp = json.loads(json_string)

# Or from a file...
with open('some_json.json') as json_file:
    temp = json.load(json_file)

请注意

json.loads

和

json.loads

之间的区别。您能提供至少两行的json吗？我更新了问题并添加了另一行row@stovfl我添加了

print（（line.json））

的输出，请参阅编辑#3您的输出显示

行。json

为您提供

json

无需进一步操作。@stovfl我在使用json时仍有问题。加载此json时，请参阅编辑3中添加的代码

import json

temp = json.loads(json_string)

# Or from a file...
with open('some_json.json') as json_file:
    temp = json.load(json_file)