Python 在数据帧中展平嵌套Json_Python_Json_Pandas_Flatten_Json Normalize

Python 在数据帧中展平嵌套Json

python json pandas

Python 在数据帧中展平嵌套Json,python,json,pandas,flatten,json-normalize,Python,Json,Pandas,Flatten,Json Normalize,我正在尝试将json文件加载到数据帧中。我发现有一些嵌套的json。下面是json示例： {'events': [{'id': 142896214, 'playerId': 37831, 'teamId': 3157, 'matchId': 2214569, 'matchPeriod': '1H', 'eventSec': 0.8935539999999946, 'eventId': 8, 'eventName': 'Pass', 'subEv

我正在尝试将json文件加载到数据帧中。我发现有一些嵌套的json。下面是json示例：

{'events': [{'id': 142896214,
   'playerId': 37831,
   'teamId': 3157,
   'matchId': 2214569,
   'matchPeriod': '1H',
   'eventSec': 0.8935539999999946,
   'eventId': 8,
   'eventName': 'Pass',
   'subEventId': 85,
   'subEventName': 'Simple pass',
   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],
   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}

我使用以下代码将json加载到dataframe中：

with open('EVENTS.json') as f:
    jsonstr = json.load(f)

df = pd.io.json.json_normalize(jsonstr['events'])

下面是df.head（）的输出

但是我发现了两个嵌套的列，比如positions和tags

我尝试使用以下代码将其展平：

Position_data = json_normalize(data =jsonstr['events'], record_path='positions', meta = ['x','y','x','y'] )

它向我显示了一个错误，如下所示：

KeyError: "Try running with errors='ignore' as key 'x' is not always present"

您能告诉我如何展平位置和标记（那些具有嵌套数据的位置和标记）吗

谢谢， Zep

正如公认的答案中所指出的那样，
```
flatte_json
```
是一个很好的选择，这取决于json的结构以及该结构应如何展平。
- 在这种情况下，OP希望1个事件的所有值都在一行上，因此
```
flatte\u json
```
  可以工作
- 如果所需的结果是
```
positions
```
  中的每个位置都有一个单独的行，那么
```
pandas.json\u normalize
```
  是更好的选择
```
flatte_json
```
的一个问题是，如果有许多
```
位置
```
，那么
```
事件
```
中每个事件的列数可能非常大
如果使用
```
flatte\u json
```
，请参阅以获得更全面的解释

为

事件中的每个dict创建一行
data={'events'：[{'id'：142896214，
“玩家ID”：37831，
“团队ID”：3157，
“matchId”：2214569，
“匹配周期”：1H，
“事件秒”：0.8935539999946，
“eventId”：8，
“eventName”：“通过”，
“subEventId”：85，
'subEventName'：'Simple pass'，
'位置'：[{'x'：51，'y'：49}，{'x'：40，'y'：53}]，
'tags'：[{'id'：1801，'tag'：{'label'：'accurial'}]}

创建数据帧
df = pd.DataFrame.from_dict(data)
df = df['events'].apply(pd.Series)


使用pd.系列展平位置

df_p = df['positions'].apply(pd.Series)

df_p_0 = df_p[0].apply(pd.Series)
df_p_1 = df_p[1].apply(pd.Series)

重命名位置[0]
和位置[1]
：
df_p_0.columns = ['pos_0_x', 'pos_0_y']
df_p_1.columns = ['pos_1_x', 'pos_1_y']

df_t = df.tags.apply(pd.Series)
df_t = df_t[0].apply(pd.Series)
df_t_t = df_t.tag.apply(pd.Series)

df_t =  df_t.rename(columns={'id': 'tags_id'})
df_t_t.columns = ['tags_tag_label']

df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

df_new = df_new.drop(['positions', 'tags'], axis=1)

用pd.Series展平标签
：

df_p_0.columns = ['pos_0_x', 'pos_0_y']
df_p_1.columns = ['pos_1_x', 'pos_1_y']

df_t = df.tags.apply(pd.Series)
df_t = df_t[0].apply(pd.Series)
df_t_t = df_t.tag.apply(pd.Series)

df_t =  df_t.rename(columns={'id': 'tags_id'})
df_t_t.columns = ['tags_tag_label']

df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

df_new = df_new.drop(['positions', 'tags'], axis=1)

重命名id
和标签
：
df_p_0.columns = ['pos_0_x', 'pos_0_y']
df_p_1.columns = ['pos_1_x', 'pos_1_y']

df_t = df.tags.apply(pd.Series)
df_t = df_t[0].apply(pd.Series)
df_t_t = df_t.tag.apply(pd.Series)

df_t =  df_t.rename(columns={'id': 'tags_id'})
df_t_t.columns = ['tags_tag_label']

df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

df_new = df_new.drop(['positions', 'tags'], axis=1)

将它们与pd.concat
：
df_p_0.columns = ['pos_0_x', 'pos_0_y']
df_p_1.columns = ['pos_1_x', 'pos_1_y']

df_t = df.tags.apply(pd.Series)
df_t = df_t[0].apply(pd.Series)
df_t_t = df_t.tag.apply(pd.Series)

df_t =  df_t.rename(columns={'id': 'tags_id'})
df_t_t.columns = ['tags_tag_label']

df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

df_new = df_new.drop(['positions', 'tags'], axis=1)

删除旧列：
df_p_0.columns = ['pos_0_x', 'pos_0_y']
df_p_1.columns = ['pos_1_x', 'pos_1_y']

df_t = df.tags.apply(pd.Series)
df_t = df_t[0].apply(pd.Series)
df_t_t = df_t.tag.apply(pd.Series)

df_t =  df_t.rename(columns={'id': 'tags_id'})
df_t_t.columns = ['tags_tag_label']

df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

df_new = df_new.drop(['positions', 'tags'], axis=1)


为位置中的每个位置创建单独的行
#规范化事件
df=pd.json_规范化（数据“事件”）
#用目录列表分解所有列
df=df.apply（lambda x:x.explode（））.reset_索引（drop=True）
#包含dicts的列列表
cols_to_normalize=['positions'，'tags']
#如果存在将成为列名的键，则与删除列名重叠
#添加当前列名作为前缀
规范化=列表（）
对于cols\u中的col\u进行规格化：
d=pd.json_规范化（df[col]，sep=''''u'）
d、 columns=[f'{col}{v}'表示d.columns中的v]
规范化的.append（d.copy（））
#将df与规范化列组合
df=pd.concat（[df]+标准化，轴=1）。drop（列=cols\u to\u标准化）
#显示（df）
id playerId teamId matchId matchPeriod eventSec eventId eventName subEventId subEventName positions\u x positions\u y tags\u id tags\u tag\u标签
0 142896214 37831 3157 2214569 1H 0.893554通过85简单通过51 49 1801准确
1 142896214 37831 3157 2214569 1H 0.893554 8通过85简单通过40 53 1801准确
如果您正在寻找从json展开多个层次结构的更通用的方法，您可以使用递归和列表理解来重塑数据。一个备选方案如下：
def flant_json（嵌套的_json，排除=[''）：
“”“将带有嵌套键的json对象展平到单个级别。
Args：
嵌套的json：一个嵌套的json对象。
排除：从输出中排除的键。
返回：
如果成功，则显示平坦的json对象，否则不显示。
"""
out={}
def展平（x，名称=''，排除=排除）：
如果类型（x）为dict：
对于x中的a：
如果a不在排除范围内：展平（x[a]，名称+a+''.'）
elif类型（x）为列表：
i=0
对于x中的a：
展平（a，name+str（i）+''.'）
i+=1
其他：
out[name[：-1]]=x
展平（嵌套的_json）
返回

然后，您可以独立于嵌套级别应用于数据：
新样本数据
this_dict={'events'：[
{'id'：142896214，
“玩家ID”：37831，
“团队ID”：3157，
“matchId”：2214569，
“匹配周期”：1H，
“事件秒”：0.8935539999946，
“eventId”：8，
“eventName”：“通过”，
“subEventId”：85，
'subEventName'：'Simple pass'，
'位置'：[{'x'：51，'y'：49}，{'x'：40，'y'：53}]，
'tags'：[{'id'：1801，'tag'：{'label'：'accurial'}]}，
{'id'：142896214，
“玩家ID”：37831，
“团队ID”：3157，
“matchId”：2214569，
“匹配周期”：1H，
“事件秒”：0.8935539999946，
“eventId”：8，
“eventName”：“通过”，
“subEventId”：85，
'subEventName'：'Simple pass'，
'位置'：[{'x'：51，'y'：49}，{'x'：40，'y'：53}，{'x'：51，'y'：49}]，
'tags'：[{'id'：1801，'tag'：{'label'：'accurial'}]}
]}

用法
df = pd.DataFrame.from_dict(data)
df = df['events'].apply(pd.Series)

pd.DataFrame（[flatte_json（x）表示本目录中的x['events']））
出[1]：
id玩家id团队id matchId matchPeriod事件秒事件id\
0 142896214 37831 3157 2214569 1H 0.893554 8
14289621437831 3157 2214569 1H 0.893554 8
事件名称子事件ID子事件名称位置\u 0\u x位置\u 0\u y\
0通过85简单通过51 49
1次通过85次简单通过51次