Python 使用dataframe中的JSON对象优化解析文件，其中某些行中可能缺少键_Python_Json_Performance_Pandas_Memory

Python 使用dataframe中的JSON对象优化解析文件，其中某些行中可能缺少键

python json performance pandas memory

Python 使用dataframe中的JSON对象优化解析文件，其中某些行中可能缺少键,python,json,performance,pandas,memory,Python,Json,Performance,Pandas,Memory,我希望优化下面的代码，这需要5秒钟，对于一个只有1000行的文件来说太慢了我有一个大文件，其中每一行都包含有效的JSON，每个JSON如下所示（实际数据要大得多并且嵌套，因此我使用这个JSON片段进行说明）：我需要解析此文件，以便仅从每个JSON中提取一些键值，以获得结果数据帧： Groupe Id MotherName FatherName Advanced 56 Laure James Middle 11 Ann

我希望优化下面的代码，这需要5秒钟，对于一个只有1000行的文件来说太慢了

我有一个大文件，其中每一行都包含有效的JSON，每个JSON如下所示（实际数据要大得多并且嵌套，因此我使用这个JSON片段进行说明）：

我需要解析此文件，以便仅从每个JSON中提取一些键值，以获得结果数据帧：

Groupe      Id   MotherName   FatherName
Advanced    56   Laure         James
Middle      11   Ann           Nicolas
Advanced    6    Helen         Franc

但是我在dataframe中需要的一些键在一些JSON对象中丢失了，因此我应该验证该键是否存在，如果不存在，则用Null填充相应的值。我使用以下方法：

df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
    for chunk in f:
        jfile = json.loads(chunk)

        if 'groupe' in jfile['location']:
            groupe = jfile['location']['groupe']
        else:
            groupe=np.nan

        if 'id' in jfile:
            id = jfile['id']
        else:
            id = np.nan

        if 'MotherName' in jfile['Mother']:
            MotherName = jfile['Mother']['MotherName']
        else:
            MotherName = np.nan

        if 'FatherName' in jfile['Father']:
            FatherName = jfile['Father']['FatherName']
        else: 
            FatherName = np.nan

        df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
            ignore_index=True)

我需要优化整个1000行文件的运行时，关键是不要将每一行附加到循环中的数据帧。您希望将集合保存在列表或dict容器中，然后一次连接所有集合。您还可以使用一个简单的

get

来简化

if/else

结构，如果在字典中找不到该项，该结构将返回一个默认值（例如np.nan）

with open (path/to/file) as f:
    d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
    for chunk in f:
        jfile = json.loads(chunk)
        d['groupe'].append(jfile['location'].get('groupe', np.nan))
        d['id'].append(jfile.get('id', np.nan))
        d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
        d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))

    df = pd.DataFrame(d)

如果您可以在初始化过程中一步构建数据帧，那么您将获得最佳性能

DataFrame.from_record

获取一系列元组，您可以从一次读取一条记录的生成器提供这些元组。您可以使用

get

更快地解析数据，当找不到项时，它将提供一个默认参数。我创建了一个名为

dummy

的空

dict

来传递中间

get

s，这样您就知道链式get可以工作

我创建了一个1000条记录的数据集，在我糟糕的笔记本电脑上，时间从18秒变为0.06秒。那很好

import numpy as np
import pandas as pd
import json
import time

def extract_data(data):
    """ convert 1 json dict to records for import"""
    dummy = {}
    jfile = json.loads(data.strip())
    return (
        jfile.get('location', dummy).get('groupe', np.nan), 
        jfile.get('id', np.nan),
        jfile.get('Mother', dummy).get('MotherName', np.nan),
        jfile.get('Father', dummy).get('FatherName', np.nan))

start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
    columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)

#
# The original way
#

start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
      for chunk in f:
           jfile=json.loads(chunk)
           if 'groupe' in jfile['location']:
               groupe=jfile['location']['groupe']
           else:
               groupe=np.nan
           if 'id' in jfile:
                id=jfile['id']
           else:
                id=np.nan
           if 'MotherName' in jfile['Mother']:
                MotherName=jfile['Mother']['MotherName']
           else:
                MotherName=np.nan
           if 'FatherName' in jfile['Father']:
                FatherName=jfile['Father']['FatherName']
           else: 
                FatherName=np.nan
           df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
            ignore_index=True)
print('original', time.time()-start)

我有

AttributeError:“list”对象没有使用此方法的属性“get”

！别忘了我有一个文件，每行都有json，也许这是个问题。因此，我需要迭代这些行来解析每个json，因为整个文件不是json本身，但是这个文件的每一行都是有效的jsonit，除非有嵌套的json而不是字典！在这种情况下如何使用.get方法@我不确定“嵌套json”是什么意思。它是一个json编码的字符串吗？也许你可以解码它，并用解码后的结构替换字符串。Amanda请编辑你的问题，为这些角落案例添加示例数据。JSON中的解析问题真的很难重现…；-）您的答案是好的，但有一个错误

TypeError:在将字典转换为数据框架时，列表索引必须是整数，而不是str

，听起来数据可能有问题。试着从每一列创建一个数据帧，看看是否可以隔离这个问题。你应该使用Python的。这也使您的内部循环代码4x更加紧凑易读。但您可能可以使用

dict.update

或

defaultdict

来进一步减少。

import numpy as np
import pandas as pd
import json
import time

def extract_data(data):
    """ convert 1 json dict to records for import"""
    dummy = {}
    jfile = json.loads(data.strip())
    return (
        jfile.get('location', dummy).get('groupe', np.nan), 
        jfile.get('id', np.nan),
        jfile.get('Mother', dummy).get('MotherName', np.nan),
        jfile.get('Father', dummy).get('FatherName', np.nan))

start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
    columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)

#
# The original way
#

start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
      for chunk in f:
           jfile=json.loads(chunk)
           if 'groupe' in jfile['location']:
               groupe=jfile['location']['groupe']
           else:
               groupe=np.nan
           if 'id' in jfile:
                id=jfile['id']
           else:
                id=np.nan
           if 'MotherName' in jfile['Mother']:
                MotherName=jfile['Mother']['MotherName']
           else:
                MotherName=np.nan
           if 'FatherName' in jfile['Father']:
                FatherName=jfile['Father']['FatherName']
           else: 
                FatherName=np.nan
           df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
            ignore_index=True)
print('original', time.time()-start)