在python中从dataset转换为显示mean、st.dev
我有一个数据集df,我想在输出中保留id、start、end和value列的同时,显示各组之间的平均值和st偏差比较在python中从dataset转换为显示mean、st.dev,python,pandas,numpy,Python,Pandas,Numpy,我有一个数据集df,我想在输出中保留id、start、end和value列的同时,显示各组之间的平均值和st偏差比较 id start end value a 5/1/2020 6/1/2020 2 a 6/1/2020 7/1/2020 3 a 7/1/2020 8/1/2020 4 a 8/1/2020 9/1/2020 20 b 5/1/2020 6/1/2020 15 b
id start end value
a 5/1/2020 6/1/2020 2
a 6/1/2020 7/1/2020 3
a 7/1/2020 8/1/2020 4
a 8/1/2020 9/1/2020 20
b 5/1/2020 6/1/2020 15
b 6/1/2020 7/1/2020 2
b 7/1/2020 8/1/2020 1
期望输出
id start end value mean stdev 1SD 2SD 3SD
a 5/1/2020 6/1/2020 2
a 6/1/2020 7/1/2020 3
a 7/1/2020 8/1/2020 4
a 8/1/2020 9/1/2020 20 29 8.539126 20.5 12 -5.5
b 5/1/2020 6/1/2020 15
b 6/1/2020 7/1/2020 2
b 7/1/2020 8/1/2020 1 18 7.81025 10.2 2.4 -5.4
所需的输出,对id和值进行分组,并查找
平均值、标准偏差和1SD、2SD和3SD
这就是我正在做的:
df = pd.read_csv("data.csv")
output = df.groupby('id')['value'].agg(['mean','std'])
output['1SD'] = output['mean'] - output['std']
output['2SD'] = output['mean'] - 2 *output['std']
output['3SD'] = out['mean'] - 3 *output['std']
但是,我无法保留开始、结束和值列。我仍在研究和任何
感谢您的建议。请在此处使用:
如果每个组只需要一行,则仅添加匹配的最后重复行:
mask = ~df['id'].duplicated(keep='last')
g = output.groupby('id')['value']
output.loc[mask, 'mean'] = g.transform('mean')
output.loc[mask, 'std'] = g.transform('std')
output['1SD'] = output['mean'] - output['std']
output['2SD'] = output['mean'] - 2 *output['std']
output['3SD'] = output['mean'] - 3 *output['std']
print (output)
id start end value mean std 1SD 2SD 3SD
0 a 5/1/2020 6/1/2020 2 NaN NaN NaN NaN NaN
1 a 6/1/2020 7/1/2020 3 NaN NaN NaN NaN NaN
2 a 7/1/2020 8/1/2020 4 NaN NaN NaN NaN NaN
3 a 8/1/2020 9/1/2020 20 7.25 8.539126 -1.289126 -9.828251 -18.367377
4 b 5/1/2020 6/1/2020 15 NaN NaN NaN NaN NaN
5 b 6/1/2020 7/1/2020 2 NaN NaN NaN NaN NaN
6 b 7/1/2020 8/1/2020 1 6.00 7.810250 -1.810250 -9.620499 -17.430749
嗨@jezrael我会试试这个。你能解释一下“掩码”的作用吗?@Lynnette-它只为每个组的最后一行分配行,如果省略它,则会重复每个组的所有值,换句话说,在第二个解决方案中没有像NaN那样的NaN。
mask = ~df['id'].duplicated(keep='last')
g = output.groupby('id')['value']
output.loc[mask, 'mean'] = g.transform('mean')
output.loc[mask, 'std'] = g.transform('std')
output['1SD'] = output['mean'] - output['std']
output['2SD'] = output['mean'] - 2 *output['std']
output['3SD'] = output['mean'] - 3 *output['std']
print (output)
id start end value mean std 1SD 2SD 3SD
0 a 5/1/2020 6/1/2020 2 NaN NaN NaN NaN NaN
1 a 6/1/2020 7/1/2020 3 NaN NaN NaN NaN NaN
2 a 7/1/2020 8/1/2020 4 NaN NaN NaN NaN NaN
3 a 8/1/2020 9/1/2020 20 7.25 8.539126 -1.289126 -9.828251 -18.367377
4 b 5/1/2020 6/1/2020 15 NaN NaN NaN NaN NaN
5 b 6/1/2020 7/1/2020 2 NaN NaN NaN NaN NaN
6 b 7/1/2020 8/1/2020 1 6.00 7.810250 -1.810250 -9.620499 -17.430749