Python 数据透视表,同时维护行的索引
我有以下数据帧Python 数据透视表,同时维护行的索引,python,pandas,dataframe,Python,Pandas,Dataframe,我有以下数据帧 df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"], 'col2': ["10-0-11-99", "running"
df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"],
'col2': ["10-0-11-99", "running","10-0-11-19", "running", "0/344","10-0-11-23", "running", "on"]})
col1 col2
0 ip 10-0-11-99
1 state running
2 ip 10-0-11-19
3 state running
4 jobs 0/344
5 ip 10-0-11-23
6 state running
7 status on
我想把它转换成以下格式
ip state jobs status
0 10-0-11-99 running Nan Nan
1 10-0-11-19 running 0/344 Nan
2 10-0-11-23 running Nan on
我使用了下面的代码来转换它,但是cumcount
给出了错误的索引。我没有办法解决这个问题了:
cols = df.groupby(['col1'],sort=False).aggregate(np.sum).reset_index()['col1'].to_list()
df['index'] = df.groupby(df['col1'])['col1'].cumcount()
df = df.pivot(index='index', columns='col1', values='col2')
df = df.reindex(cols, axis=1)
col1 ip state jobs status
index
0 10-0-11-99 running 0/344 on
1 10-0-11-19 running NaN NaN
2 10-0-11-23 running NaN NaN
一个关键的假设是
ip
决定了数据将如何被转置,并且始终存在。然后,我们可以检查具有ip
的行,进行累积求和,构建多索引,然后取消堆栈:
res = df.assign(temp=df.col1.eq("ip").cumsum())
res
col1 col2 temp
0 ip 10-0-11-99 1
1 state running 1
2 ip 10-0-11-19 2
3 state running 2
4 jobs 0/344 2
5 ip 10-0-11-23 3
6 state running 3
7 status on 3
构建多索引:
index = pd.MultiIndex.from_product([res.col1.unique(),
res.temp.unique()],
names=["col1", "temp"])
MultiIndex([( 'ip', 1),
( 'ip', 2),
( 'ip', 3),
( 'state', 1),
( 'state', 2),
( 'state', 3),
( 'jobs', 1),
( 'jobs', 2),
( 'jobs', 3),
('status', 1),
('status', 2),
('status', 3)],
names=['col1', 'temp'])
最后一个阶段是设置索引并使用索引重新索引
:
res.set_index(["col1", "temp"]).reindex(index).unstack("col1")
col2
col1 ip jobs state status
temp
1 10-0-11-99 NaN running NaN
2 10-0-11-19 0/344 running NaN
3 10-0-11-23 NaN running on
您可以重置索引,但这很简单,所以我没有为此烦恼
另一方面,您可以使用函数from抽象索引构建过程以公开缺少的行;目前,您必须从以下位置安装最新的开发版本:
我认为最简单、最直接的方法是使用简单的逻辑循环您的
df
,因此我们不需要担心索引:
import pandas as pd
df = pd.DataFrame({'col1': ["ip", "state", "ip", "state", "jobs", "ip", "state", "status"],
'col2': ["10-0-11-99", "running","10-0-11-19", "running", "0/344","10-0-11-23", "running", "on"]})
new_df = pd.DataFrame(columns=["ip", "state", "jobs", "status"])
new_row = {}
for index, row in df.iterrows():
if row['col1'] in new_row:
new_df = new_df.append(new_row, ignore_index=True)
new_row = {}
if row['col1'] not in new_row:
new_row[row['col1']] = row['col2']
new_df = new_df.append(new_row, ignore_index=True)
print(new_df)
产出:
ip state jobs status
0 10-0-11-99 running NaN NaN
1 10-0-11-19 running 0/344 NaN
2 10-0-11-23 running NaN on
ip state jobs status
0 10-0-11-99 running NaN NaN
1 10-0-11-19 running 0/344 NaN
2 10-0-11-23 running NaN on