Python 按所有列分组的熊猫分布表_Python_Python 2.7_Pandas_Dataframe

Python 按所有列分组的熊猫分布表

python python-2.7 pandas dataframe

Python 按所有列分组的熊猫分布表,python,python-2.7,pandas,dataframe,Python,Python 2.7,Pandas,Dataframe,我有一个如下所示的pandas数据框，我想为每个独特的记录聚合并获得分布： col1 col2 col3 0 1 3 0 1 1 2 0 2 1 2 0 3 1 5 1 4 1 3 1 5 1 5 0 我希望获得如下所示的数据帧：

我有一个如下所示的pandas数据框，我想为每个独特的记录聚合并获得分布：

      col1   col2    col3  
0       1      3       0  
1       1      2       0  
2       1      2       0  
3       1      5       1  
4       1      3       1  
5       1      5       0

我希望获得如下所示的数据帧：

       col1   col2    col3   distribution
0       1      3       0         0.166
1       1      3       1         0.166
2       1      2       0         0.333
3       1      5       1         0.166
4       1      5       0         0.166

有一种简单的方法可以做到这一点吗？

假设包含数据的数据帧被称为

df

，遍历每一行（这将返回一系列的行），计算分布（这里假设标准偏差），并在末尾附加一个新列。例如：

distribution = list()
for row in df.iterrows():
  distribution.append(row[1].std())
df['distribution'] = distribution

您应该能够使用apply并沿正确的轴指定它。在本例中，我找到了每行的平均值，但您可以使用分布函数：

import pandas as pd
import numpy as np

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['c1','c2','c3'])

df
   c1  c2  c3
0   1   2   3
1   4   5   6
2   7   8   9

df.loc[:, 'row_mean'] = df.apply(np.mean, axis=1)

df
   c1  c2  c3  row_mean
0   1   2   3         2
1   4   5   6         5
2   7   8   9         8

您可以使用with，创建新列

distribution

with，并将其除以：

你的分布函数是什么？对不起，如果不清楚的话，我希望不是每行的平均值，而是通过对所有列进行分组得到每个唯一记录的频率，因此c1、c2、c3的每个唯一组合，在我的示例1、2、0中出现两次，并将其除以总行数

df = df.groupby(['col1','col2','col3'])['col1'].count().reset_index(name='distribution')
df['distribution'] = df['distribution'] / df['distribution'].sum()
print df
   col1  col2  col3  distribution
0     1     2     0      0.333333
1     1     3     0      0.166667
2     1     3     1      0.166667
3     1     5     0      0.166667
4     1     5     1      0.166667