Python 数据帧的转置和组合_Python_Python 3.x_Pandas_Dataframe_Statistics

Python 数据帧的转置和组合

python python-3.x pandas dataframe statistics

Python 数据帧的转置和组合,python,python-3.x,pandas,dataframe,statistics,Python,Python 3.x,Pandas,Dataframe,Statistics,假设我有这种格式的数据 option,subcase,prop1,prop2,prop3,... 以.csv 现在，我想为每个选项创建统计信息，并为每个子类单独创建其他统计信息如果我只是想把它全部打印出来，而对置信区间不感兴趣，它可能看起来有点像这样： import numpy as np import pandas as pd import sys df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file c

假设我有这种格式的数据

option,subcase,prop1,prop2,prop3,...

以

.csv

现在，我想为每个

选项

创建统计信息，并为每个

子类

单独创建其他统计信息

如果我只是想把它全部打印出来，而对置信区间不感兴趣，它可能看起来有点像这样：

import numpy as np
import pandas as pd
import sys

df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content

options = df.option.unique()
option_data = {}

subcases = df.subcase.unique()
data = {}

for o in options:

    option_data[o] = df[df.option.apply(lambda row: o in row)]
    print(o)
    print(pd.DataFrame.describe(option_data[o]))

    for s in subcases:
        label = o + '_' + s
        data[label] = option_data[o][option_data[o].subcase.apply(lambda row: s in row)]        
        print(label)
        print(pd.DataFrame.describe(data[label]))

    print()

然而，这很难理解

如何最好地组合数据帧s.t.我最终得到的帧如下

prop1    mean    std    min    25%    ...
A
A_a
A_b
A_c
B
B_a
B_c
...

prop2    mean    std    min    25%    ...
A
A_a
A_b
A_c
B
B_a
B_c
...

我的意思是，我可以手动循环所有帧。。。但必须有更有效的办法

编辑

例如

应生成两个帧：

成本

及

时间

其中，

和

行的条目是基于其所有相应子类别的条目计算的。

使用

pd.concat

df1=df.groupby('option').cost.describe()
df2=df.groupby(['option','subcase']).cost.describe()

df2.index=df2.index.map('_'.join)
pd.concat([df1,df2]).sort_index()


Out[256]: 
        count  mean       std   min    25%   50%    75%   max
A         4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0
A_sub1    2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0
A_sub2    2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0
B         4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0
B_sub1    2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0
B_sub2    2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0

更新：

In [97]: r = df.groupby(['option','subcase']).describe()

In [100]: t = df.groupby('option').describe().set_index(np.array([''] * df['option'].nunique()), append=True)

In [101]: r.append(t).sort_index()
Out[101]:
                cost                                                  time
               count  mean       std   min    25%   50%    75%   max count  mean       std  min   25%  50%   75%  max
option subcase
A                4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0   4.0  3.25  3.403430  0.0  1.50  2.5  4.25  8.0
       sub1      2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0   2.0  1.50  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0   2.0  5.00  4.242641  2.0  3.50  5.0  6.50  8.0
B                4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0   4.0  1.50  1.914854  0.0  0.00  1.0  2.50  4.0
       sub1      2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0   2.0  0.00  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0   2.0  3.00  1.414214  2.0  2.50  3.0  3.50  4.0

你能提供一个小的可复制的样本数据集和相应的所需数据集吗？@MaxU Done，请参见上面的编辑。我喜欢这种格式。有没有办法向

和

添加“总计”子类别？

,mean,std,min,25%,50%,75%,max
A,3.25,3.40343,0,1.5,2.5,4.25,8
A_sub1,1.5,2.12132,0,0.75,1.5,2.25,3
A_sub2,5,4.242641,2,3.5,5,6.5,8
B,1.5,1.914854,0,0,1,2.5,4
B_sub1,0,0,0,0,0,0,0
B_sub2,3,1.414214,2,2.5,3,3.5,4

df1=df.groupby('option').cost.describe()
df2=df.groupby(['option','subcase']).cost.describe()

df2.index=df2.index.map('_'.join)
pd.concat([df1,df2]).sort_index()


Out[256]: 
        count  mean       std   min    25%   50%    75%   max
A         4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0
A_sub1    2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0
A_sub2    2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0
B         4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0
B_sub1    2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0
B_sub2    2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0

In [79]: df.groupby(['option','subcase']).describe()
Out[79]:
                cost                                                time
               count  mean       std   min   25%   50%   75%   max count mean       std  min   25%  50%   75%  max
option subcase
A      sub1      2.0   3.0  1.414214   2.0   2.5   3.0   3.5   4.0   2.0  1.5  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.5   2.0   2.5   3.0   2.0  5.0  4.242641  2.0  3.50  5.0  6.50  8.0
B      sub1      2.0  12.0  1.414214  11.0  11.5  12.0  12.5  13.0   2.0  0.0  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.5   4.0   4.5   5.0   2.0  3.0  1.414214  2.0  2.50  3.0  3.50  4.0

In [97]: r = df.groupby(['option','subcase']).describe()

In [100]: t = df.groupby('option').describe().set_index(np.array([''] * df['option'].nunique()), append=True)

In [101]: r.append(t).sort_index()
Out[101]:
                cost                                                  time
               count  mean       std   min    25%   50%    75%   max count  mean       std  min   25%  50%   75%  max
option subcase
A                4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0   4.0  3.25  3.403430  0.0  1.50  2.5  4.25  8.0
       sub1      2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0   2.0  1.50  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0   2.0  5.00  4.242641  2.0  3.50  5.0  6.50  8.0
B                4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0   4.0  1.50  1.914854  0.0  0.00  1.0  2.50  4.0
       sub1      2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0   2.0  0.00  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0   2.0  3.00  1.414214  2.0  2.50  3.0  3.50  4.0