在python中从dataset转换为显示mean、st.dev

在python中从dataset转换为显示mean、st.dev,python,pandas,numpy,Python,Pandas,Numpy,我有一个数据集df,我想在输出中保留id、start、end和value列的同时,显示各组之间的平均值和st偏差比较 id start end value a 5/1/2020 6/1/2020 2 a 6/1/2020 7/1/2020 3 a 7/1/2020 8/1/2020 4 a 8/1/2020 9/1/2020 20 b 5/1/2020 6/1/2020 15 b

我有一个数据集df,我想在输出中保留id、start、end和value列的同时,显示各组之间的平均值和st偏差比较

id  start       end         value

a   5/1/2020    6/1/2020    2
a   6/1/2020    7/1/2020    3
a   7/1/2020    8/1/2020    4
a   8/1/2020    9/1/2020    20
b   5/1/2020    6/1/2020    15
b   6/1/2020    7/1/2020    2
b   7/1/2020    8/1/2020    1
期望输出

id  start       end         value   mean    stdev       1SD       2SD     3SD

a   5/1/2020    6/1/2020    2                   
a   6/1/2020    7/1/2020    3                   
a   7/1/2020    8/1/2020    4                   
a   8/1/2020    9/1/2020    20      29      8.539126    20.5      12     -5.5
b   5/1/2020    6/1/2020    15                  
b   6/1/2020    7/1/2020    2                   
b   7/1/2020    8/1/2020    1       18      7.81025     10.2      2.4    -5.4
所需的输出,对id和值进行分组,并查找 平均值、标准偏差和1SD、2SD和3SD

这就是我正在做的:

df = pd.read_csv("data.csv")

output = df.groupby('id')['value'].agg(['mean','std'])   

output['1SD'] = output['mean'] - output['std']              

output['2SD'] = output['mean'] - 2 *output['std']           

output['3SD'] = out['mean'] - 3 *output['std']    
但是,我无法保留开始、结束和值列。我仍在研究和任何 感谢您的建议。

请在此处使用:

如果每个组只需要一行,则仅添加匹配的最后重复行:

mask = ~df['id'].duplicated(keep='last')
g = output.groupby('id')['value']
output.loc[mask, 'mean'] = g.transform('mean')   
output.loc[mask, 'std'] = g.transform('std')

output['1SD'] = output['mean'] - output['std']              
output['2SD'] = output['mean'] - 2 *output['std']           
output['3SD'] = output['mean'] - 3 *output['std']    

print (output)
  id     start       end  value  mean       std       1SD       2SD        3SD
0  a  5/1/2020  6/1/2020      2   NaN       NaN       NaN       NaN        NaN
1  a  6/1/2020  7/1/2020      3   NaN       NaN       NaN       NaN        NaN
2  a  7/1/2020  8/1/2020      4   NaN       NaN       NaN       NaN        NaN
3  a  8/1/2020  9/1/2020     20  7.25  8.539126 -1.289126 -9.828251 -18.367377
4  b  5/1/2020  6/1/2020     15   NaN       NaN       NaN       NaN        NaN
5  b  6/1/2020  7/1/2020      2   NaN       NaN       NaN       NaN        NaN
6  b  7/1/2020  8/1/2020      1  6.00  7.810250 -1.810250 -9.620499 -17.430749

嗨@jezrael我会试试这个。你能解释一下“掩码”的作用吗?@Lynnette-它只为每个组的最后一行分配行,如果省略它,则会重复每个组的所有值,换句话说,在第二个解决方案中没有像NaN那样的NaN。
mask = ~df['id'].duplicated(keep='last')
g = output.groupby('id')['value']
output.loc[mask, 'mean'] = g.transform('mean')   
output.loc[mask, 'std'] = g.transform('std')

output['1SD'] = output['mean'] - output['std']              
output['2SD'] = output['mean'] - 2 *output['std']           
output['3SD'] = output['mean'] - 3 *output['std']    

print (output)
  id     start       end  value  mean       std       1SD       2SD        3SD
0  a  5/1/2020  6/1/2020      2   NaN       NaN       NaN       NaN        NaN
1  a  6/1/2020  7/1/2020      3   NaN       NaN       NaN       NaN        NaN
2  a  7/1/2020  8/1/2020      4   NaN       NaN       NaN       NaN        NaN
3  a  8/1/2020  9/1/2020     20  7.25  8.539126 -1.289126 -9.828251 -18.367377
4  b  5/1/2020  6/1/2020     15   NaN       NaN       NaN       NaN        NaN
5  b  6/1/2020  7/1/2020      2   NaN       NaN       NaN       NaN        NaN
6  b  7/1/2020  8/1/2020      1  6.00  7.810250 -1.810250 -9.620499 -17.430749