Pandas 熊猫自定义功能应用于熔化的数据帧
我有一个融化的数据框,看起来像这样:Pandas 熊猫自定义功能应用于熔化的数据帧,pandas,apply,Pandas,Apply,我有一个融化的数据框,看起来像这样: date group metric n_events total_users 0 2017-01-01 control metric1 33.919910 827.416818 27 2017-01-01 variant1 metric1 55.141467 780.840083 54 2017-01-01 variant2 metric1 63.045587 436.381533 1
date group metric n_events total_users
0 2017-01-01 control metric1 33.919910 827.416818
27 2017-01-01 variant1 metric1 55.141467 780.840083
54 2017-01-01 variant2 metric1 63.045587 436.381533
1 2017-01-02 control metric2 74.013340 145.551779
28 2017-01-02 variant1 metric2 78.539663 553.410827
我想在融化的数据帧上计算一些提升度量。到目前为止,我旋转了数据帧,这并不理想
import pandas as pd
df = pd.DataFrame(
{'group': sorted(['control','variant1','variant2']*27),
'metric': ['metric1', 'metric2', 'metric3']*27,
'n_events': np.random.uniform(100,20,size=81),
'total_users': np.random.uniform(1000, 20, size=81),
'date' : list(pd.date_range('1/1/2017', periods=27, freq='D'))*3
})
df = df.sort_values(['date','group','metric'])
t = pd.pivot_table(df, values=['n_events','total_users'],
index=['date','metric'],
columns=['group'],
aggfunc=np.sum).reset_index()
for var in ['variant1','variant2']:
uplift_colname = var + "_standard_uplift"
# adding daily uplift
t[uplift_colname] =(t['n_events'][var]/t['total_users'][var])-\
(t['n_events']['control']/t['total_users']['control'])
我正在寻找一种更好的方法来获得提升,而不必旋转数据帧,从而保持融合的数据格式。我尝试使用groupby
或apply
以及一个自定义函数,即
df.groupby(['date','metric'])['n_events','group','total_users'].apply(myfxn)
这与当前t
获取的信息相同
group variant1_standard_uplift variant2_standard_uplift
date metric
2017-01-01 metric1 -0.175006 -0.334146
2017-01-02 metric2 0.213414 0.007030
2017-01-03 metric3 0.041405 0.913016
2017-01-04 metric1 -0.102361 -0.044124
2017-01-05 metric2 0.114260 0.031469
2017-01-06 metric3 0.316760 -0.113277
2017-01-07 metric1 3.049462 0.052456
2017-01-08 metric2 -0.050300 -0.015628
2017-01-09 metric3 0.004769 0.239641
2017-01-10 metric1 0.025574 0.153893
2017-01-11 metric2 0.111758 0.083404
2017-01-12 metric3 -0.175687 -0.107851
2017-01-13 metric1 0.147153 0.266303
2017-01-14 metric2 -0.162214 -0.238798
2017-01-15 metric3 0.137627 0.010475
2017-01-16 metric1 -0.223583 -0.208177
2017-01-17 metric2 0.154821 0.189663
2017-01-18 metric3 -0.161725 -0.536955
2017-01-19 metric1 -0.002525 0.027977
2017-01-20 metric2 -0.210697 0.564725
2017-01-21 metric3 -0.228038 -0.255461
2017-01-22 metric1 -0.210647 -0.141039
2017-01-23 metric2 0.354086 -0.366433
2017-01-24 metric3 0.344310 -0.045895
2017-01-25 metric1 0.340080 0.105040
2017-01-26 metric2 2.512369 -0.062200
2017-01-27 metric3 -1.326842 -1.819911
要保持与df
相同的数据帧,但要附加两个新列
def proc(df):
s = df.groupby('group').sum()
r = s.n_events / s.total_users
return r.drop('control').sub(r.loc['control'])
gcols = ['date', 'metric']
ocols = ['group', 'n_events', 'total_users']
suffix = '_standard_uplift'
df.join(df.groupby(gcols)[ocols].apply(proc).add_suffix(suffix), on=gcols).sort_index()
这与当前t
获取的信息相同
group variant1_standard_uplift variant2_standard_uplift
date metric
2017-01-01 metric1 -0.175006 -0.334146
2017-01-02 metric2 0.213414 0.007030
2017-01-03 metric3 0.041405 0.913016
2017-01-04 metric1 -0.102361 -0.044124
2017-01-05 metric2 0.114260 0.031469
2017-01-06 metric3 0.316760 -0.113277
2017-01-07 metric1 3.049462 0.052456
2017-01-08 metric2 -0.050300 -0.015628
2017-01-09 metric3 0.004769 0.239641
2017-01-10 metric1 0.025574 0.153893
2017-01-11 metric2 0.111758 0.083404
2017-01-12 metric3 -0.175687 -0.107851
2017-01-13 metric1 0.147153 0.266303
2017-01-14 metric2 -0.162214 -0.238798
2017-01-15 metric3 0.137627 0.010475
2017-01-16 metric1 -0.223583 -0.208177
2017-01-17 metric2 0.154821 0.189663
2017-01-18 metric3 -0.161725 -0.536955
2017-01-19 metric1 -0.002525 0.027977
2017-01-20 metric2 -0.210697 0.564725
2017-01-21 metric3 -0.228038 -0.255461
2017-01-22 metric1 -0.210647 -0.141039
2017-01-23 metric2 0.354086 -0.366433
2017-01-24 metric3 0.344310 -0.045895
2017-01-25 metric1 0.340080 0.105040
2017-01-26 metric2 2.512369 -0.062200
2017-01-27 metric3 -1.326842 -1.819911
要保持与df
相同的数据帧,但要附加两个新列
def proc(df):
s = df.groupby('group').sum()
r = s.n_events / s.total_users
return r.drop('control').sub(r.loc['control'])
gcols = ['date', 'metric']
ocols = ['group', 'n_events', 'total_users']
suffix = '_standard_uplift'
df.join(df.groupby(gcols)[ocols].apply(proc).add_suffix(suffix), on=gcols).sort_index()
你能提供一个期望结果的例子吗?你能提供一个期望结果的例子吗?谢谢你,
r.drop('control').sub(r.loc['control']
做什么?r
将是一个pd.Series
有3个索引['control',variant1',variant2]
r.drop('control)
将删除与索引'control'
关联的条目,保留另外两个条目,然后通过.sbub(r.loc['control'])减去与'control'
关联的值。
谢谢,r.drop('control').sub(r.loc['control']
do?r
将是一个具有3个索引的pd系列[“控制”、“变量1”、“变量2”]
r.drop('control')
将删除与索引'control'
相关的条目,保留其他两个,然后用.sbub(r.loc>减去与'control'相关的值['control'])