Python 重塑数据帧以在嵌套dict中导出

Python 重塑数据帧以在嵌套dict中导出,python,pandas,dictionary,dataframe,Python,Pandas,Dictionary,Dataframe,给定以下数据帧: Category Area Country Code Function Last Name LanID Spend1 Spend2 Spend3 Spend4 Spend5 0 Bisc EE RU02,UA02 Mk Smith df3432 1.0 NaN NaN NaN NaN 1 Bisc E

给定以下数据帧:

   Category Area               Country Code Function Last Name     LanID  Spend1  Spend2  Spend3  Spend4  Spend5
0      Bisc   EE                  RU02,UA02       Mk     Smith    df3432     1.0     NaN     NaN     NaN     NaN
1      Bisc   EE                       RU02       Mk      Bibs    fdss34     1.0     NaN     NaN     NaN     NaN
2      Bisc   EE               UA02,EURASIA       Mk      Crow   fdsdr43     1.0     NaN     NaN     NaN     NaN
3      Bisc   WE                       FR31       Mk     Ellis   fdssdf3     1.0     NaN     NaN     NaN     NaN
4      Bisc   WE                  BE32,NL31       Mk     Mower   TOZ1720     1.0     NaN     NaN     NaN     NaN
5      Bisc   WE             FR31,BE32,NL31      LKU      Elan   SKY8851     1.0     1.0     1.0     1.0     1.0
6      Bisc   SE                       IT31       Mk    Bobret    3dfsfg     1.0     NaN     NaN     NaN     NaN
7      Bisc   SE                       GR31       Mk   Concept  MOSGX009     1.0     NaN     NaN     NaN     NaN
8      Bisc   SE   RU02,IT31,GR31,PT31,ES31      LKU     Solar   MSS5723     1.0     1.0     1.0     1.0     1.0
9      Bisc   SE        IT31,GR31,PT31,ES31       Mk      Brix    fdgd22     NaN     1.0     NaN     NaN     NaN
10     Choc   CE   RU02,CZ31,SK31,PL31,LT31      Fin    Ocoser    43233d     NaN     1.0     NaN     NaN     NaN
11     Choc   CE        DE31,AT31,HU31,CH31      Fin     Smuth     4rewf     NaN     1.0     NaN     NaN     NaN
12     Choc   CE              BG31,RO31,EMA      Fin    Momocs    hgghg2     NaN     1.0     NaN     NaN     NaN
13     Choc   WE             FR31,BE32,NL31      Fin   Bruntly    ffdd32     NaN     NaN     NaN     NaN     1.0
14     Choc   WE             FR31,BE32,NL31       Mk      Ofer  BROGX011     NaN     1.0     1.0     NaN     NaN
15     Choc   WE             FR31,BE32,NL31       Mk       Hem   NZJ3189     NaN     NaN     NaN     1.0     1.0
16      G&C   NE                  UA02,SE31       Mk       Cre   ORY9499     1.0     NaN     NaN     NaN     NaN
17      G&C   NE                       NO31       Mk      Qlyo   XVM7639     1.0     NaN     NaN     NaN     NaN
18      G&C   NE   GB31,NO31,SE31,IE31,FI31       Mk      Omny   LOX1512     NaN     1.0     1.0     NaN     NaN
我希望将其导出到具有以下结构的嵌套Dict中:

    {RU02:  {Bisc:  {EE:    {Mkt:   {Spend1:    {df3432:    Smith}
                                                {fdss34:     Bibs}
            {Bisc:  {SE:    {LKU:   {Spend1:    {MSS5723:   Solar}
                                    {Spend2:    {MSS5723:   Solar}
                                    {Spend3:    {MSS5723:   Solar}
                                    {Spend4:    {MSS5723:   Solar}
                                    {Spend5:    {MSS5723:   Solar}
            {Choc:  {CE:    {Fin:   {Spend2:    {43233d:   Ocoser}
            .....

    {UA02:  {Bisc:  {EE:    {Mkt:   {Spend1:    {df3432:    Smith}
                                                {ffdsdr43:   Crow}
            {G&C:   {NE:    {Mkt:   {Spend1:    {ORY9499:     Cre}
    .....
因此,本质上,在这篇文章中,我试图追踪每个国家代码,每个支出类别(支出1、支出2等)的姓氏+登录名列表以及它们的属性(功能、类别、区域)

数据框不是很大(少于200行),但它包含类别/地区/国家代码以及姓氏及其支出类别(多对多)之间几乎所有类型的组合

我的挑战是,我无法弄清楚如何清楚地概念化我需要做的步骤,以便正确地准备数据帧以导出到Dict

到目前为止,我认为我需要:

  • 基于“,”分隔符:完成”对“国家/地区代码”列的内容进行切片的方法
  • 基于唯一的国家/地区代码创建新列,并在预设该列代码的每行中设置1个:完成
  • 将DataFrame的索引递归地设置为每个新添加的列
  • 将有数据的每个国家/地区代码的每行移到新的数据框中
  • 将所有新数据帧导出到DICT,然后合并它们
  • 但是,我不确定步骤3-6是否是进行此操作的最佳方式,因为我仍然难以理解如何为我的案例配置pd.DataFrame.to_dict(如果可能的话)

    非常感谢您在编码方面的帮助,也非常感谢您简要解释您在每个阶段的思维过程


    这是我一个人走了多远

    #keeping track of initial order of columns
    initialOrder = list(df.columns.values)
    
    # split the Country Code by ","
    CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]
    
    # add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
    #with NaN for row values
    df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])
    
    # reordering columns to have the newly added ones at the end
    reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
    df = df[reordered]
    
    
    # replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
    # exists in the initial column "Country Code"; do this for each row
    
    CCodeUniqueOnly = set(CCodeNoCommas)
    for c in CCodeUniqueOnly:   
        CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]
    
        #print (CCodeIsPresent_rowIndex)
        df.loc[CCodeIsPresent_rowIndex, c] = 1
    
    # no clue what do do next ??
    

    如果您将数据帧重新格式化为正确的格式,您可以使用方便的递归字典函数,从@DSM到。目标是获得一个数据帧,其中每行只包含一个“条目”——您感兴趣的列的唯一组合

    首先,您需要将国家代码字符串拆分为列表:

    df['Country Code'] = df['Country Code'].str.split(',')
    
    然后将这些列表展开为多行(使用@RomanPekar的技术从):

    然后,您可以将
    Spend*
    列重塑为行,其中每个
    Spend*
    列都有一行,其中的值不是
    nan

    spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
    df = df.groupby('Country Code') \
        .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
        .reset_index(level=1)['level_1'])) \
        .reset_index(drop=True)
    
    现在您有了一个数据框架,其中嵌套字典中的每个级别都是自己的列。因此,您可以使用此递归字典函数:

    def recur_dictify(frame):
        if len(frame.columns) == 1:
            if frame.values.size == 1: return frame.values[0][0]
            return frame.values.squeeze()
        grouped = frame.groupby(frame.columns[0])
        d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
        return d
    
    并仅将其应用于要生成嵌套字典的列,按嵌套顺序列出:

    cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
    d = recur_dictify(df[cols])
    
    这应该会产生你想要的结果


    一体式:

    df['Country Code'] = df['Country Code'].str.split(',')
    s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
        .stack().reset_index(level=1, drop=True)
    s.name = 'Country Code'
    df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
    
    spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
    df = df.groupby('Country Code') \
        .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
        .reset_index(level=1)['level_1'])) \
        .reset_index(drop=True)
    
    def recur_dictify(frame):
        if len(frame.columns) == 1:
            if frame.values.size == 1: return frame.values[0][0]
            return frame.values.squeeze()
        grouped = frame.groupby(frame.columns[0])
        d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
        return d
    
    cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
    d = recur_dictify(df[cols])
    

    这当然不是一个简单的解决办法。请展示你的努力。对不起,我发帖时忘了这么做。请现在在编辑版本@c中找到它ᴏʟᴅsᴘᴇᴇᴅ杰出!它工作得非常完美,我也非常感谢逐步的解释。非常感谢,我很高兴能帮上忙。
    df['Country Code'] = df['Country Code'].str.split(',')
    s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
        .stack().reset_index(level=1, drop=True)
    s.name = 'Country Code'
    df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
    
    spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
    df = df.groupby('Country Code') \
        .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
        .reset_index(level=1)['level_1'])) \
        .reset_index(drop=True)
    
    def recur_dictify(frame):
        if len(frame.columns) == 1:
            if frame.values.size == 1: return frame.values[0][0]
            return frame.values.squeeze()
        grouped = frame.groupby(frame.columns[0])
        d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
        return d
    
    cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
    d = recur_dictify(df[cols])