Python 数据透视表,同时维护行的索引

Python 数据透视表,同时维护行的索引,python,pandas,dataframe,Python,Pandas,Dataframe,我有以下数据帧 df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"], 'col2': ["10-0-11-99", "running"

我有以下数据帧

df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"],
                   'col2': ["10-0-11-99", "running","10-0-11-19", "running", "0/344","10-0-11-23", "running", "on"]})

    col1      col2
0   ip  10-0-11-99
1   state   running
2   ip  10-0-11-19
3   state   running
4   jobs    0/344
5   ip  10-0-11-23
6   state   running
7   status  on
我想把它转换成以下格式

          ip    state   jobs    status
0   10-0-11-99  running Nan     Nan
1   10-0-11-19  running 0/344   Nan
2   10-0-11-23  running Nan     on
我使用了下面的代码来转换它,但是
cumcount
给出了错误的索引。我没有办法解决这个问题了:

cols = df.groupby(['col1'],sort=False).aggregate(np.sum).reset_index()['col1'].to_list()
df['index'] = df.groupby(df['col1'])['col1'].cumcount()
df = df.pivot(index='index', columns='col1', values='col2')
df = df.reindex(cols, axis=1)

col1    ip     state    jobs      status
index               
0   10-0-11-99  running 0/344      on
1   10-0-11-19  running NaN       NaN
2   10-0-11-23  running NaN       NaN

一个关键的假设是
ip
决定了数据将如何被转置,并且始终存在。然后,我们可以检查具有
ip
的行,进行累积求和,构建多索引,然后取消堆栈:

res = df.assign(temp=df.col1.eq("ip").cumsum())
res

    col1    col2      temp
0   ip      10-0-11-99  1
1   state   running     1
2   ip      10-0-11-19  2
3   state   running     2
4   jobs    0/344       2
5   ip      10-0-11-23  3
6   state   running     3
7   status  on          3
构建多索引:

index = pd.MultiIndex.from_product([res.col1.unique(), 
                                    res.temp.unique()], 
                                    names=["col1", "temp"])

MultiIndex([(    'ip', 1),
            (    'ip', 2),
            (    'ip', 3),
            ( 'state', 1),
            ( 'state', 2),
            ( 'state', 3),
            (  'jobs', 1),
            (  'jobs', 2),
            (  'jobs', 3),
            ('status', 1),
            ('status', 2),
            ('status', 3)],
           names=['col1', 'temp'])
最后一个阶段是设置索引并使用
索引重新索引

res.set_index(["col1", "temp"]).reindex(index).unstack("col1")

                               col2
col1    ip      jobs    state   status
temp                
1   10-0-11-99  NaN     running NaN
2   10-0-11-19  0/344   running NaN
3   10-0-11-23  NaN     running on
您可以重置索引,但这很简单,所以我没有为此烦恼

另一方面,您可以使用函数from抽象索引构建过程以公开缺少的行;目前,您必须从以下位置安装最新的开发版本:


我认为最简单、最直接的方法是使用简单的逻辑循环您的
df
,因此我们不需要担心索引:

import pandas as pd

df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"],
                   'col2': ["10-0-11-99", "running","10-0-11-19", "running", "0/344","10-0-11-23", "running", "on"]})
new_df = pd.DataFrame(columns=["ip", "state", "jobs", "status"])

new_row = {}
for index, row in df.iterrows():
    if row['col1'] in new_row:
        new_df = new_df.append(new_row, ignore_index=True)
        new_row = {}

    if row['col1'] not in new_row:
        new_row[row['col1']] = row['col2']
new_df = new_df.append(new_row, ignore_index=True)

print(new_df)
产出:

           ip    state   jobs status
0  10-0-11-99  running    NaN    NaN
1  10-0-11-19  running  0/344    NaN
2  10-0-11-23  running    NaN     on
           ip    state   jobs status
0  10-0-11-99  running    NaN    NaN
1  10-0-11-19  running  0/344    NaN
2  10-0-11-23  running    NaN     on