Python 在方框中绘制数据框在卡盘中绘制值_Python_Pandas_Group By_Boxplot

Python 在方框中绘制数据框在卡盘中绘制值

python pandas

Python 在方框中绘制数据框在卡盘中绘制值,python,pandas,group-by,boxplot,Python,Pandas,Group By,Boxplot,我有一个单列数据框，如下所示 df = pd.DataFrame(np.random.randn(20, 1), columns=['Time']) df['EDGE'] = pd.Series(['A', 'A', 'A','B', 'B', 'A', 'B','C','C', 'B','D','A','E','F','F','A','G','H','H','A']) df 真正的数据帧有几十万行，唯一的“边缘”值列表大约有200行我希望以箱线图

我有一个单列数据框，如下所示

df = pd.DataFrame(np.random.randn(20, 1),
                      columns=['Time'])
df['EDGE'] = pd.Series(['A', 'A', 'A','B', 'B', 'A', 'B','C','C', 'B','D','A','E','F','F','A','G','H','H','A'])
df

真正的数据帧有几十万行，唯一的“边缘”值列表大约有200行

我希望以箱线图的方式绘制结果，如下所示：

boxplot = df.boxplot(by='EDGE')

现在有这么多的值，我必须稍微打印一点，只需在同一个图中先说10个字母。另一方面，我想先打印平均时间较大的值

预期结果：箱线图的集合每个箱线图包括10条边。按降序显示的框与平均“时间”有关

如何进行

我试了什么

我试图用loc为每个值生成sub_df，但之后每个箱线图只能得到一个框我尝试使用groupby来通过“EDGE”进行gourp，但没有任何效果，因为我不知道如何仅绘制数据帧的前n组

# define the number of edges per plot
nb_edges_per_plot = 4 #to change to your needs

# group by edge
gr = df.groupby('EDGE')['Time']
# get the mean per group and sort them 
order_ = gr.mean().sort_values(ascending=False).index
print (order_) #order depends on the random value so probably not same for you
#Index(['D', 'H', 'C', 'B', 'A', 'E', 'G', 'F'], dtype='object', name='EDGE')

# reshape your dataframe to ake each EDGE a column and order the columns
df_ = df.set_index(['EDGE', gr.cumcount()])['Time'].unstack(0)[order_]
print (df_.iloc[:5, :5])
# EDGE         D         H         C         B         A
# 0     1.729417  0.270593 -0.140786 -0.540270  0.862832
# 1          NaN  0.647830  1.038952 -0.129361 -0.648432
# 2          NaN       NaN       NaN -1.235637 -0.430890
# 3          NaN       NaN       NaN  0.631744 -1.622461
# 4          NaN       NaN       NaN       NaN  0.694052

注意：我假装使用尽可能少的库，也就是说，如果我可以使用pandas比使用matplotlib做得更好，并且matplotlib比使用matplotlib之上的另一个库做得更好，那么您可以通过重塑数据帧来实现

# define the number of edges per plot
nb_edges_per_plot = 4 #to change to your needs

# group by edge
gr = df.groupby('EDGE')['Time']
# get the mean per group and sort them 
order_ = gr.mean().sort_values(ascending=False).index
print (order_) #order depends on the random value so probably not same for you
#Index(['D', 'H', 'C', 'B', 'A', 'E', 'G', 'F'], dtype='object', name='EDGE')

# reshape your dataframe to ake each EDGE a column and order the columns
df_ = df.set_index(['EDGE', gr.cumcount()])['Time'].unstack(0)[order_]
print (df_.iloc[:5, :5])
# EDGE         D         H         C         B         A
# 0     1.729417  0.270593 -0.140786 -0.540270  0.862832
# 1          NaN  0.647830  1.038952 -0.129361 -0.648432
# 2          NaN       NaN       NaN -1.235637 -0.430890
# 3          NaN       NaN       NaN  0.631744 -1.622461
# 4          NaN       NaN       NaN       NaN  0.694052

现在，您可以使用

groupby

进行

boxplot

。要在子地块上绘制每组边，请执行以下操作：

df_.groupby(np.arange(len(order_))//nb_edges_per_plot, axis=1).boxplot()

或者如果你想要分开的数字，那么你可以这样做

for _, dfg_ in df_.groupby(np.arange(len(order_))//nb_edges_per_plot, axis=1):
    dfg_.plot(kind='box')

或者，即使在一行中，您也可以得到单独的图形，区别在于使用

boxplot（）

而不是使用

plot.box（）

。请注意，如果要更改每个绘图中的参数，则循环版本更灵活

df_.groupby(np.arange(len(order_))//nb_edges_per_plot, axis=1).plot.box()

您可以创建一个中间帧

组

，将边指定给绘图编号（列

顺序

）和每个绘图内的边位置（列

位置

）

例如：

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(20, 1), columns=['Time'])
df['EDGE'] = pd.Series(['A', 'A', 'A','B', 'B', 'A', 'B','C','C', 'B','D','A','E','F','F','A','G','H','H','A'])

# code from above ...

#verification:
print(df.groupby('EDGE').Time.mean().sort_values(ascending=False))
#EDGE
#G    1.494079
#B    1.367285
#E    0.761038
#A    0.442789
#F    0.282769
#D    0.144044
#H    0.053955
#C   -0.127288

是/已更正