Python 向数据帧添加百分比列_Python_Pandas_Data Science

Python 向数据帧添加百分比列

python pandas

Python 向数据帧添加百分比列,python,pandas,data-science,Python,Pandas,Data Science,我有一个类似如下的想法： User Purchase_Count Location_Count 1 2 3 2 10 5 3 5 1 4 20 4 5 2 3 6 2 3 7 10 5

我有一个类似如下的想法：

User    Purchase_Count    Location_Count
1       2                 3
2       10                5
3       5                 1
4       20                4
5       2                 3
6       2                 3
7       10                5

我如何添加一列来计算总条目的坐标对的百分比

（purche_Count[I]，Location_Count[I]）

。例如，我希望df看起来像：

User    Purchase_Count    Location_Count    %
1       2                 3                 42.85
2       10                5                 28.57
3       5                 1                 14.28
4       20                4                 14.28
5       2                 3                 42.85
6       2                 3                 42.85
7       10                5                 28.57

pandas

解决方案是使用

groupby

，然后使用

transform

：

In [43]: df
Out[43]:
   User  Purchase_Count  Location_Count
0     1               2               3
1     2              10               5
2     3               5               1
3     4              20               4
4     5               2               3
5     6               2               3
6     7              10               5

In [44]: total = len(df)

In [45]: df['percentage'] = df.groupby(['Purchase_Count', 'Location_Count']).transform(lambda r: r.count()/total)

In [46]: df
Out[46]:
   User  Purchase_Count  Location_Count  percentage
0     1               2               3    0.428571
1     2              10               5    0.285714
2     3               5               1    0.142857
3     4              20               4    0.142857
4     5               2               3    0.428571
5     6               2               3    0.428571
6     7              10               5    0.285714

编辑以提高可读性编辑：根据@piRSquared的建议，您可以使用：

df.groupby(['Purchase_Count', 'Location_Count']).transform('count') / total

相反，初步测试表明它的速度要快得多。

使用

groupby

和

size

以及

join

cols = ['Purchase_Count', 'Location_Count']
df.join(df.groupby(cols).size().div(len(df)).rename('%'), on=cols)

   User  Purchase_Count  Location_Count         %
0     1               2               3  0.428571
1     2              10               5  0.285714
2     3               5               1  0.142857
3     4              20               4  0.142857
4     5               2               3  0.428571
5     6               2               3  0.428571
6     7              10               5  0.285714

旧答案

对元组使用

pd.value\u计数
tups = df[['Purchase_Count', 'Location_Count']].apply(tuple, 1)
df.assign(**{'%': tups.map(pd.value_counts(tups, normalize=True))})

   User  Purchase_Count  Location_Count         %
0     1               2               3  0.428571
1     2              10               5  0.285714
2     3               5               1  0.142857
3     4              20               4  0.142857
4     5               2               3  0.428571
5     6               2               3  0.428571
6     7              10               5  0.285714


定时
既然total
是常量，你应该能够避开df.groupby（['Purchase\u Count'，'Location\u Count'）。transform（'Count'）/total无论哪种方式+1我都建议接受@juanpa.arrivillaga的答案。它更快，而且可以说更地道。
tups = df[['Purchase_Count', 'Location_Count']].apply(tuple, 1)
df.assign(**{'%': tups.map(pd.value_counts(tups, normalize=True))})

   User  Purchase_Count  Location_Count         %
0     1               2               3  0.428571
1     2              10               5  0.285714
2     3               5               1  0.142857
3     4              20               4  0.142857
4     5               2               3  0.428571
5     6               2               3  0.428571
6     7              10               5  0.285714