Python 为每行值生成描述性统计信息并动态转置
我有一个如下所示的数据帧Python 为每行值生成描述性统计信息并动态转置,python,python-3.x,pandas,dataframe,pandas-groupby,Python,Python 3.x,Pandas,Dataframe,Pandas Groupby,我有一个如下所示的数据帧 df = pd.DataFrame({ 'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4], 'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'], 'val' :[5,6,7
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
'val' :[5,6,7,11,5,7,16,12,13,56,32,13,45,43,46],
})
我想做的是获取现有列的描述性统计/摘要形式,而不是原始列。我希望看到(min
,max
,25%
,75%
,std
,var
)作为每个主题的新列
我尝试了下面的方法,但输出不准确
df.groupby(['subject_id','readings']).describe().reset_index() #this gives some output but it isn't exact
df.groupby(['subject_id','readings']).pivot_table(values='val', index='subject_id', columns='readings').describe() # this throws error
我希望我的输出如下所示。基本上,它将是一个广泛而稀疏的矩阵。由于截图很宽,我无法进一步放大。如果您单击图像,您将更好地显示预期的输出
用于描述后的重塑,然后用于原始添加中的顺序:
谢谢你会努力的。你让它看起来很简单。好极了Upvoted@SSMK-解决方案已更改,请测试最新版本。目前,我正在对大小为
4779657(行)和26(列)
的数据集应用此解决方案。它运行了半个多小时。有没有其他方法可以加快这个速度?@SSMK-我检查了这两个问题,但不适用于dask实现。原因是它有点像旋转,这通常是非常复杂的操作,不容易并行化。有没有什么方法可以将其应用于这样的大型数据集?
df = (df.groupby(['subject_id','readings'])['val']
.describe()
.unstack()
.swaplevel(0,1,axis=1)
.reindex(df['readings'].unique(), axis=1, level=0))
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
subject_id READ_1_count READ_1_mean READ_1_std READ_1_min READ_1_25% \
0 1 2.0 6.0 1.414214 5.0 5.5
1 2 1.0 5.0 NaN 5.0 5.0
2 3 NaN NaN NaN NaN NaN
3 4 NaN NaN NaN NaN NaN
READ_1_50% READ_1_75% READ_1_max READ_2_count ... READ_08_75% \
0 6.0 6.5 7.0 1.0 ... NaN
1 5.0 5.0 5.0 NaN ... NaN
2 NaN NaN NaN NaN ... NaN
3 NaN NaN NaN NaN ... 43.0
READ_08_max READ_07_count READ_07_mean READ_07_std READ_07_min \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 43.0 1.0 46.0 NaN 46.0
READ_07_25% READ_07_50% READ_07_75% READ_07_max
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 46.0 46.0 46.0 46.0
[4 rows x 105 columns]