Python 数据透视表，同时维护行的索引_Python_Pandas_Dataframe

Python 数据透视表，同时维护行的索引

python pandas dataframe

Python 数据透视表，同时维护行的索引,python,pandas,dataframe,Python,Pandas,Dataframe,我有以下数据帧 df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"], 'col2': ["10-0-11-99", "running"

我有以下数据帧

df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"],
                   'col2': ["10-0-11-99", "running","10-0-11-19", "running", "0/344","10-0-11-23", "running", "on"]})

    col1      col2
0   ip  10-0-11-99
1   state   running
2   ip  10-0-11-19
3   state   running
4   jobs    0/344
5   ip  10-0-11-23
6   state   running
7   status  on

我想把它转换成以下格式

          ip    state   jobs    status
0   10-0-11-99  running Nan     Nan
1   10-0-11-19  running 0/344   Nan
2   10-0-11-23  running Nan     on

我使用了下面的代码来转换它，但是

cumcount

给出了错误的索引。我没有办法解决这个问题了：

cols = df.groupby(['col1'],sort=False).aggregate(np.sum).reset_index()['col1'].to_list()
df['index'] = df.groupby(df['col1'])['col1'].cumcount()
df = df.pivot(index='index', columns='col1', values='col2')
df = df.reindex(cols, axis=1)

col1    ip     state    jobs      status
index               
0   10-0-11-99  running 0/344      on
1   10-0-11-19  running NaN       NaN
2   10-0-11-23  running NaN       NaN

一个关键的假设是

ip

决定了数据将如何被转置，并且始终存在。然后，我们可以检查具有

ip

的行，进行累积求和，构建多索引，然后取消堆栈：

res = df.assign(temp=df.col1.eq("ip").cumsum())
res

    col1    col2      temp
0   ip      10-0-11-99  1
1   state   running     1
2   ip      10-0-11-19  2
3   state   running     2
4   jobs    0/344       2
5   ip      10-0-11-23  3
6   state   running     3
7   status  on          3

构建多索引：

index = pd.MultiIndex.from_product([res.col1.unique(), 
                                    res.temp.unique()], 
                                    names=["col1", "temp"])

MultiIndex([(    'ip', 1),
            (    'ip', 2),
            (    'ip', 3),
            ( 'state', 1),
            ( 'state', 2),
            ( 'state', 3),
            (  'jobs', 1),
            (  'jobs', 2),
            (  'jobs', 3),
            ('status', 1),
            ('status', 2),
            ('status', 3)],
           names=['col1', 'temp'])

最后一个阶段是设置索引并使用

索引重新索引

：

res.set_index(["col1", "temp"]).reindex(index).unstack("col1")

                               col2
col1    ip      jobs    state   status
temp                
1   10-0-11-99  NaN     running NaN
2   10-0-11-19  0/344   running NaN
3   10-0-11-23  NaN     running on

您可以重置索引，但这很简单，所以我没有为此烦恼

另一方面，您可以使用函数from抽象索引构建过程以公开缺少的行；目前，您必须从以下位置安装最新的开发版本：

我认为最简单、最直接的方法是使用简单的逻辑循环您的

df

，因此我们不需要担心索引：

import pandas as pd

df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"],
                   'col2': ["10-0-11-99", "running","10-0-11-19", "running", "0/344","10-0-11-23", "running", "on"]})
new_df = pd.DataFrame(columns=["ip", "state", "jobs", "status"])

new_row = {}
for index, row in df.iterrows():
    if row['col1'] in new_row:
        new_df = new_df.append(new_row, ignore_index=True)
        new_row = {}

    if row['col1'] not in new_row:
        new_row[row['col1']] = row['col2']
new_df = new_df.append(new_row, ignore_index=True)

print(new_df)

产出：

           ip    state   jobs status
0  10-0-11-99  running    NaN    NaN
1  10-0-11-19  running  0/344    NaN
2  10-0-11-23  running    NaN     on

           ip    state   jobs status
0  10-0-11-99  running    NaN    NaN
1  10-0-11-19  running  0/344    NaN
2  10-0-11-23  running    NaN     on