如何在Python中将json文件中的特定键插入到数据帧中_Python_Json_Pandas_Dataframe

如何在Python中将json文件中的特定键插入到数据帧中

python json pandas dataframe

如何在Python中将json文件中的特定键插入到数据帧中,python,json,pandas,dataframe,Python,Json,Pandas,Dataframe,抱歉，如果这很简单，或者已经有人问过，我对Python和json文件还不熟悉，所以我很困惑我有一个9GB的json文件是从一个网站上刮下来的。这些数据包括大约300万个人的信息。每个人都有属性，但并非所有人都有相同的属性。属性对应于json文件中的键，如下所示： { "_id": "in-00000001", "name": { "family_name": "Trump", "given_name": "Donald" }, "locality": "Unit

抱歉，如果这很简单，或者已经有人问过，我对Python和json文件还不熟悉，所以我很困惑

我有一个9GB的json文件是从一个网站上刮下来的。这些数据包括大约300万个人的信息。每个人都有属性，但并非所有人都有相同的属性。属性对应于json文件中的键，如下所示：

{
  "_id": "in-00000001",
  "name": {
    "family_name": "Trump",
    "given_name": "Donald"
  },
  "locality": "United States",
  "skills": [
    "Twitter",
    "Real Estate",
    "Golf"
     ],
  "industry": "Government",
  "experience": [
  {
    "org": "Republican",
    "end": "Present",
    "start": "January 2017",
    "title": "President of the United States"
  },
  {
    "org": "The Apprentice",
    "end": "2015",
    "start": "2003",
    "title": "The guy that fires people"
  }]
}

因此，在这里，

\u id

、

姓名

、

地区

、

技能

、

行业

和

经验

都是属性（关键）。另一个配置文件可能具有附加属性，如

教育

，

奖励

，

兴趣

，或者缺少另一个配置文件中的某些属性，如

技能

属性，等等

我想做的是扫描json文件中的每个配置文件，如果配置文件包含属性

skills

、

industry

和

experience

，我想提取这些信息并将其插入到数据框中（我想我需要熊猫来实现这一点？）。根据

经验

，我想特别提取他们当前雇主的姓名，即

组织

下的最新列表。数据框如下所示：

    Industry   | Current employer | Skills
    ___________________________________________________________________
    Government | Republican       | Twitter, Real Estate, Golf
    Marketing  | Marketers R Us   | Branding, Social Media, Advertising

。。。对于具有这三个属性的所有概要文件，依此类推

我正在努力寻找一个好的资源来解释如何做这种事情，因此我的问题

我想粗略的伪代码应该是：

for each profile in open(path to .json file):
    if profile has keys "experience", "industry" AND "skills":
        on the same row of the data frame:
            insert current employer into "current employer" column of 
            data frame
            insert industry into "industry" column of data frame
            insert list of skills into "skills" column of data frame

我只需要知道如何用Python编写它。

我假设该文件包含所有配置文件，例如

{
    "profile 1" : {
        # Full object as in the example above
    },
    "profile 2" : {
        #Full object as in the example above
    }
}

在继续之前，让我演示如何正确使用数据帧

更好地使用数据帧的示例：数据帧中的值不能是列表。因此，我们必须复制行，如下面的示例所示。查看此问题和JD Long的答案以了解更多详细信息：

在下面代码的注释中查找解释：

import json
import pandas as pd

# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill']) 

#Load the file as json. 
with open(path to .json file) as file:
    #readlines() reads the file as string and loads() loads it into a dict
    obj = json.loads(''.join(file.readlines()))
    #Then iterate its items() as key value pairs
    #But the line of code below depends on my first assumption.
    #Depending on the file format, the line below might have to differ.
    for prof_key, profile in obj.items():
        # Verify if a profile contains all the required keys
        if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
            for skill in profile["skills"]:
                df.loc[-1] = [profile["_id"],
                              profile["industry"],
                              [x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
                              skill]

上面的行，

df.loc[-1]=…

在数据帧中插入一行作为最后一行（索引

-1

）

当以后您希望使用此信息时，必须使用

df.groupby（'ID'）

如果您的文件中有不同的格式，以及此解释是否足以让您开始学习，或者您需要更多内容，请告诉我。

我多次编辑了我的答案。如果有帮助，请告诉我。@Atterson当然，谢谢你花时间回答我的问题。我住的地方已经很晚了，所以我明天会仔细阅读，让你知道它是怎么回事。嗨，看来这个解决方案的问题是我的json文件是9GB，所以我在执行这一行时出现了一个内存错误：

json.loads（file.readlines（））

。有没有办法修改代码，使其读取文件的一部分，使用它，然后在文件的另一部分上重复？要回答您关于

start

和

end

的有效值的问题：本质上，任何被认为是“当前位置”的内容都将具有

end

值

“present”

。如果

end

值不是

“present”

，则可以忽略它。这些信息足够吗？查看

readlines（）

的文档，我可以看到此函数将“sizehint”作为输入，即从文件中读取的字节数。那么，是否有可能读取，比如说，第一个500MB，处理它，然后读取下一个500MB，。。。以此类推？在尝试读取前50000000字节后，我得到了以下错误：

obj=json.loads（file.readlines（50000000））#50MB文件“C:\Users\Jake\AppData\Local\Programs\Python\Python37-32\lib\json\\ uu init\uuuuu.py”，第341行，在loads raise TypeError中（f'JSON对象必须是str，bytes或bytearray，'TypeError:JSON对象必须是str，bytes或bytearray，而不是list

Ok您可以使用

readline

读取一行。然而，解析文件并不是那么简单。您应该知道读取什么，何时停止以及每次解析成JSON的内容……而且，我也不确定您要做什么。）您的文件格式是…如果您得到ByteArray并且必须

解码/编码或仅str（）
。另一种方法是读取（nbytes）
import json
import pandas as pd

# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill']) 

#Load the file as json. 
with open(path to .json file) as file:
    #readlines() reads the file as string and loads() loads it into a dict
    obj = json.loads(''.join(file.readlines()))
    #Then iterate its items() as key value pairs
    #But the line of code below depends on my first assumption.
    #Depending on the file format, the line below might have to differ.
    for prof_key, profile in obj.items():
        # Verify if a profile contains all the required keys
        if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
            for skill in profile["skills"]:
                df.loc[-1] = [profile["_id"],
                              profile["industry"],
                              [x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
                              skill]