如何从Python中的数据帧创建嵌套JSON

如何从Python中的数据帧创建嵌套JSON,python,json,python-3.x,pandas,dataframe,Python,Json,Python 3.x,Pandas,Dataframe,我有一个包含Windows10日志的数据框。我想把这个文件转换成JSON。做这件事的有效方法是什么 我已经生成了一个默认的df,但是它不是嵌套的。我多么想要它 { "0": { "ProcessName": "Firefox", "time": "2019-07-12T00:00:00", "timeFloat": 1562882400.0, "internal_time": 0.0, "counter":

我有一个包含Windows10日志的数据框。我想把这个文件转换成JSON。做这件事的有效方法是什么

我已经生成了一个默认的df,但是它不是嵌套的。我多么想要它

{
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "1": {
        "ProcessName": "Excel",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "Word",
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0,
        "internal_time": 1.5533333333,
        "counter": 0
}
我希望它看起来像这样

{
    "0": {
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes" : {
                     "Firefox" : 0 # ("counter" value),
                     "Excel" : 0 
    },
    "1": ...
}

据我所知,您需要按“时间”对对象进行分组,并合并来自不同进程的计数器。如果是-以下是实施示例:

input_data = {
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "ZXC",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "3": {
        "ProcessName": "QWE",
        "time": "else_time",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    }
}


def group_input_data_by_time(dict_data):
    time_data = {}
    for value_dict in dict_data.values():
        counter = value_dict["counter"]
        process_name = value_dict["ProcessName"]
        time_ = value_dict["time"]
        common_data = {
            "time": time_,
            "timeFloat": value_dict["timeFloat"],
            "internal_time": value_dict["internal_time"],
        }
        common_data = time_data.setdefault(time_, common_data)
        processes = common_data.setdefault("Processes", {})
        processes[process_name] = counter

    # if required to change keys from time to enumerated
    result_dict = {}
    for ind, value in enumerate(time_data.values()):
        result_dict[str(ind)] = value

    return result_dict


print(group_input_data_by_time(input_data))
结果是:

{
    "0": {
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes": {
            "Firefox": 0,
            "ZXC": 0
        }
    },
    "1": {
        "time": "else_time",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "Processes": {
            "QWE": 0
        }
    }
}

在我看来,您似乎希望基于
['time',timeFloat',internal_time']
从聚合数据创建JSON,您可以这样做:

pd.groupby(['time', 'timeFloat', 'internal_time'])
但是,您的示例表明您希望维护索引键(
“0”、“1”
,等等),这与前面所述的意图相反

来自一个时间点的聚合值:

"Firefox" : 0
"Excel" : 0 
似乎与这些索引键相对应,这些索引键在进行聚合时将丢失

但是,如果您决定使用聚合,代码将如下所示:

# reading in data:

import pandas as pd
import json
json_data = {
    "0": {
        "ProcessName": "Firefox",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "1": {
        "ProcessName": "Excel",
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0,
        "internal_time": 0.0,
        "counter": 0
    },
    "2": {
        "ProcessName": "Word",
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0,
        "internal_time": 1.5533333333,
        "counter": 0
}}

df = pd.DataFrame.from_dict(json_data)
df = df.T
df.set_index(["ProcessName", 'time', 'timeFloat', 'internal_time', 'counter'])

# processing:
ddf = df.groupby(['time', 'timeFloat', 'internal_time'], as_index=False).agg(lambda x: list(x))
ddf['Processes'] = ddf.apply(lambda r: dict(zip(r['ProcessName'], r['counter'])), axis=1)
ddf = ddf.drop(['ProcessName', 'counter'], axis=1).

# printing the result:
json2 = json.loads(ddf.to_json(orient="records"))
print(json.dumps(json2, indent=4, sort_keys=True))
结果:

[
    {
        "Processes": {
            "Excel": 0,
            "Firefox": 0
        },
        "internal_time": 0.0,
        "time": "2019-07-12T00:00:00",
        "timeFloat": 1562882400.0
    },
    {
        "Processes": {
            "Word": 0
        },
        "internal_time": 1.5533333333,
        "time": "2019-07-12T01:30:00",
        "timeFloat": 1562888000.0
    }
]

这并没有利用数据在
dataframe中这一事实。与基于
pandas
的解决方案相比,使用更多数据时,它的扩展速度会更慢。为了表示感谢,请您将其中一个答案标记为正确答案(答案左侧的勾号)?