Python 重塑数据帧以在嵌套dict中导出
给定以下数据帧:Python 重塑数据帧以在嵌套dict中导出,python,pandas,dictionary,dataframe,Python,Pandas,Dictionary,Dataframe,给定以下数据帧: Category Area Country Code Function Last Name LanID Spend1 Spend2 Spend3 Spend4 Spend5 0 Bisc EE RU02,UA02 Mk Smith df3432 1.0 NaN NaN NaN NaN 1 Bisc E
Category Area Country Code Function Last Name LanID Spend1 Spend2 Spend3 Spend4 Spend5
0 Bisc EE RU02,UA02 Mk Smith df3432 1.0 NaN NaN NaN NaN
1 Bisc EE RU02 Mk Bibs fdss34 1.0 NaN NaN NaN NaN
2 Bisc EE UA02,EURASIA Mk Crow fdsdr43 1.0 NaN NaN NaN NaN
3 Bisc WE FR31 Mk Ellis fdssdf3 1.0 NaN NaN NaN NaN
4 Bisc WE BE32,NL31 Mk Mower TOZ1720 1.0 NaN NaN NaN NaN
5 Bisc WE FR31,BE32,NL31 LKU Elan SKY8851 1.0 1.0 1.0 1.0 1.0
6 Bisc SE IT31 Mk Bobret 3dfsfg 1.0 NaN NaN NaN NaN
7 Bisc SE GR31 Mk Concept MOSGX009 1.0 NaN NaN NaN NaN
8 Bisc SE RU02,IT31,GR31,PT31,ES31 LKU Solar MSS5723 1.0 1.0 1.0 1.0 1.0
9 Bisc SE IT31,GR31,PT31,ES31 Mk Brix fdgd22 NaN 1.0 NaN NaN NaN
10 Choc CE RU02,CZ31,SK31,PL31,LT31 Fin Ocoser 43233d NaN 1.0 NaN NaN NaN
11 Choc CE DE31,AT31,HU31,CH31 Fin Smuth 4rewf NaN 1.0 NaN NaN NaN
12 Choc CE BG31,RO31,EMA Fin Momocs hgghg2 NaN 1.0 NaN NaN NaN
13 Choc WE FR31,BE32,NL31 Fin Bruntly ffdd32 NaN NaN NaN NaN 1.0
14 Choc WE FR31,BE32,NL31 Mk Ofer BROGX011 NaN 1.0 1.0 NaN NaN
15 Choc WE FR31,BE32,NL31 Mk Hem NZJ3189 NaN NaN NaN 1.0 1.0
16 G&C NE UA02,SE31 Mk Cre ORY9499 1.0 NaN NaN NaN NaN
17 G&C NE NO31 Mk Qlyo XVM7639 1.0 NaN NaN NaN NaN
18 G&C NE GB31,NO31,SE31,IE31,FI31 Mk Omny LOX1512 NaN 1.0 1.0 NaN NaN
我希望将其导出到具有以下结构的嵌套Dict中:
{RU02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{fdss34: Bibs}
{Bisc: {SE: {LKU: {Spend1: {MSS5723: Solar}
{Spend2: {MSS5723: Solar}
{Spend3: {MSS5723: Solar}
{Spend4: {MSS5723: Solar}
{Spend5: {MSS5723: Solar}
{Choc: {CE: {Fin: {Spend2: {43233d: Ocoser}
.....
{UA02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{ffdsdr43: Crow}
{G&C: {NE: {Mkt: {Spend1: {ORY9499: Cre}
.....
因此,本质上,在这篇文章中,我试图追踪每个国家代码,每个支出类别(支出1、支出2等)的姓氏+登录名列表以及它们的属性(功能、类别、区域)
数据框不是很大(少于200行),但它包含类别/地区/国家代码以及姓氏及其支出类别(多对多)之间几乎所有类型的组合
我的挑战是,我无法弄清楚如何清楚地概念化我需要做的步骤,以便正确地准备数据帧以导出到Dict
到目前为止,我认为我需要:
这是我一个人走了多远
#keeping track of initial order of columns
initialOrder = list(df.columns.values)
# split the Country Code by ","
CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]
# add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
#with NaN for row values
df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])
# reordering columns to have the newly added ones at the end
reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
df = df[reordered]
# replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
# exists in the initial column "Country Code"; do this for each row
CCodeUniqueOnly = set(CCodeNoCommas)
for c in CCodeUniqueOnly:
CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]
#print (CCodeIsPresent_rowIndex)
df.loc[CCodeIsPresent_rowIndex, c] = 1
# no clue what do do next ??
如果您将数据帧重新格式化为正确的格式,您可以使用方便的递归字典函数,从@DSM到。目标是获得一个数据帧,其中每行只包含一个“条目”——您感兴趣的列的唯一组合 首先,您需要将国家代码字符串拆分为列表:
df['Country Code'] = df['Country Code'].str.split(',')
然后将这些列表展开为多行(使用@RomanPekar的技术从):
然后,您可以将Spend*
列重塑为行,其中每个Spend*
列都有一行,其中的值不是nan
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
现在您有了一个数据框架,其中嵌套字典中的每个级别都是自己的列。因此,您可以使用此递归字典函数:
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
并仅将其应用于要生成嵌套字典的列,按嵌套顺序列出:
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
这应该会产生你想要的结果
一体式:
df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
这当然不是一个简单的解决办法。请展示你的努力。对不起,我发帖时忘了这么做。请现在在编辑版本@c中找到它ᴏʟᴅsᴘᴇᴇᴅ杰出!它工作得非常完美,我也非常感谢逐步的解释。非常感谢,我很高兴能帮上忙。
df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])