Python 熊猫:将嵌套json转换为扁平表
我有一个JSON,其结构如下:Python 熊猫:将嵌套json转换为扁平表,python,json,pandas,dataframe,Python,Json,Pandas,Dataframe,我有一个JSON,其结构如下: { "a": "a_1", "b": "b_1", "c": [{ "d": "d_1", "e": "e_1", "f": [], "g": "g_1", "h": "h_1" }, { "d": "d_2", "e": "e_2", "f": [], "g": "g_2",
{
"a": "a_1",
"b": "b_1",
"c": [{
"d": "d_1",
"e": "e_1",
"f": [],
"g": "g_1",
"h": "h_1"
}, {
"d": "d_2",
"e": "e_2",
"f": [],
"g": "g_2",
"h": "h_2"
}, {
"d": "d_3",
"e": "e_3",
"f": [{
"i": "i_1",
"j": "j_1",
"k": "k_1",
"l": "l_1",
"m": []
}, {
"i": "i_2",
"j": "j_2",
"k": "k_2",
"l": "l_2",
"m": [{
"n": "n_1",
"o": "o_1",
"p": "p_1",
"q": "q_1"
}]
}],
"g": "g_3",
"h": "h_3"
}]
}
我想将其转换为以下类型的熊猫数据帧:
我怎样才能做到这一点
以下是我的尝试,但方向完全不同 代码: 输出:
0
a a_1
b b_1
c_0_d d_1
c_0_e e_1
c_0_g g_1
c_0_h h_1
c_1_d d_2
c_1_e e_2
c_1_g g_2
c_1_h h_2
c_2_d d_3
c_2_e e_3
c_2_f_0_i i_1
c_2_f_0_j j_1
c_2_f_0_k k_1
c_2_f_0_l l_1
c_2_f_1_i i_2
c_2_f_1_j j_2
c_2_f_1_k k_2
c_2_f_1_l l_2
c_2_f_1_m_0_n n_1
c_2_f_1_m_0_o o_1
c_2_f_1_m_0_p p_1
c_2_f_1_m_0_q q_1
c_2_g g_3
c_2_h h_3
阅读前
- 这将完成问题中提出的工作,如果有一些额外的特殊性,请沟通
- 这当然是可以改进的,把它作为你问题的可能解决方案
- 请注意,解决问题的关键在于可以使用递归函数
解决方案 使用
\u dict
您的嵌套字典,您可以执行递归函数和一些技巧来实现您的目标:
我首先编写一个函数iterate_dict
,递归读取字典,并将结果存储到一个新的dict
中,其中键/值是最终的pd。Dataframe
列内容:
def iterate_dict(_dict, _fdict,level=0):
for k in _dict.keys(): #Iterate over keys of a dict
#If value is a string update _fdict
if isinstance(_dict[k],str):
#If first seen, initialize your dict
if not k in _fdict.keys():
_fdict[k] = [-1]*(level-1) #Trick to shift columns
#Append the value
_fdict[k].append(_dict[k])
#If a list
if isinstance(_dict[k],list):
if not k in _fdict.keys(): #If first seen key initialize
_fdict[k] = [-1]*(level) #Same previous trick
#Extend with required range (0, 1, 2 ...)
_fdict[k].extend([i for i in range(len(_dict[k]))])
else:
if len(_dict[k]) > 0:
_start = 0 if len(_fdict[k]) == 0 else (int(_fdict[k][-1])+1)
_fdict[k].extend([i for i in range(_start,_start+len(_dict[k]))]) #Extend
for _d in _dict[k]: #If value of key is a list recall iterate_dict
iterate_dict(_d,_fdict,level=level+1)
另一个函数,to_series
,用于将未来列的值转换为pd.series
替换先前的int
等于-1
转换为np.nan
:
def to_series(_fvalues):
if _fvalues[0] == -1:
_fvalues.insert(0,-1) #Trick to shift again
return pd.Series(_fvalues).replace(-1,np.nan) #Replace -1 with nan in case
然后像这样使用它:
_fdict = dict() #The future columns content
iterate_dict(_dict,_fdict) #Do the Job
print(_fdict)
{'a': ['a_1'],
'b': ['b_1'],
'c': [0, 1, 2],
'd': ['d_1', 'd_2', 'd_3'],
'e': ['e_1', 'e_2', 'e_3'],
'f': [-1, 0, 1],
'g': ['g_1', 'g_2', 'g_3'],
'h': ['h_1', 'h_2', 'h_3'],
'i': [-1, 'i_1', 'i_2'],
'j': [-1, 'j_1', 'j_2'],
'k': [-1, 'k_1', 'k_2'],
'l': [-1, 'l_1', 'l_2'],
'm': [-1, -1, 0],
'n': [-1, -1, 'n_1'],
'o': [-1, -1, 'o_1'],
'p': [-1, -1, 'p_1'],
'q': [-1, -1, 'q_1']}
#Here you can see a shift is required, use your custom to_series() function
然后创建您的pd.Dataframe
:
df = pd.DataFrame(dict([ (k,to_series(v)) for k,v in _fdict.items() ])).ffill()
#Don't forget to do a forward fillna as needed
print(df)
a b c d e f g h i j k l m n o \
0 a_1 b_1 0.0 d_1 e_1 NaN g_1 h_1 NaN NaN NaN NaN NaN NaN NaN
1 a_1 b_1 1.0 d_2 e_2 NaN g_2 h_2 NaN NaN NaN NaN NaN NaN NaN
2 a_1 b_1 2.0 d_3 e_3 0.0 g_3 h_3 i_1 j_1 k_1 l_1 NaN NaN NaN
3 a_1 b_1 2.0 d_3 e_3 1.0 g_3 h_3 i_2 j_2 k_2 l_2 0.0 n_1 o_1
p q
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 p_1 q_1
你一定试过什么了。让我们看看!谢天谢地,python字典与json非常相似。这帮助了我:
df = pd.DataFrame(dict([ (k,to_series(v)) for k,v in _fdict.items() ])).ffill()
#Don't forget to do a forward fillna as needed
print(df)
a b c d e f g h i j k l m n o \
0 a_1 b_1 0.0 d_1 e_1 NaN g_1 h_1 NaN NaN NaN NaN NaN NaN NaN
1 a_1 b_1 1.0 d_2 e_2 NaN g_2 h_2 NaN NaN NaN NaN NaN NaN NaN
2 a_1 b_1 2.0 d_3 e_3 0.0 g_3 h_3 i_1 j_1 k_1 l_1 NaN NaN NaN
3 a_1 b_1 2.0 d_3 e_3 1.0 g_3 h_3 i_2 j_2 k_2 l_2 0.0 n_1 o_1
p q
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 p_1 q_1