Python 应用多条件groupby+;排序+;求和到数据帧行
我有一个dataframe,它有以下列: 账号、通信日期、打开日期 对于每一个开立的账户,我都被要求回顾发生在银行内部的所有信件 该账户的开户日期为30天,然后将以下点数分配给通信:Python 应用多条件groupby+;排序+;求和到数据帧行,python,pandas,dataframe,apply,Python,Pandas,Dataframe,Apply,我有一个dataframe,它有以下列: 账号、通信日期、打开日期 对于每一个开立的账户,我都被要求回顾发生在银行内部的所有信件 该账户的开户日期为30天,然后将以下点数分配给通信: Forty-twenty-forty: Attribute 40% (0.4 points) of the attribution to the first touch, 40% to the last touch, and divide the remaining 20% between all touches
Forty-twenty-forty: Attribute 40% (0.4 points) of the attribution to the first touch,
40% to the last touch, and divide the remaining 20% between all touches in between
所以我知道应用和分组的功能,但这超出了我的工资等级。
我必须按帐户分组,有条件地基于两列的相互比较,
我必须这样做才能得到通信的总数,而且我猜它们也必须被排序,因为下面为通信分配点的步骤取决于它们发生的顺序
我希望能够有效地执行此操作,因为我有大量的行,我知道apply()可以运行得很快,但当我尝试执行的行级操作变得甚至有点复杂时,我很难应用它
我感谢任何帮助,因为我不擅长熊猫
编辑
按要求
Acct, ContactDate, OpenDate, Points (what I need to calculate)
123, 1/1/2018, 1/1/2021, 0 (because correspondance not within 30 days of open)
123, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
123, 12/11/2020, 1/1/2021, 0.2 (other 'touches' get 0.2/(num of touches-2) 'points')
123, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
456, 1/1/2018, 1/1/2021, 0 (again, because correspondance not within 30 days of open)
456, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
这将返回一个简化的数据帧,因为它排除了超过30天的时间帧,然后将原始df合并到其中,以在一个df中获取所有数据。这假设您的日期排序是正确的,否则,在应用下面的函数之前,您可能需要预先进行排序
df['Points'] = 0 #add column to dataframe before analysis
#df.columns
#Index(['Acct', 'ContactDate', 'OpenDate', 'Points'], dtype='object')
def points(x):
newx = x.loc[(x['OpenDate'] - x['ContactDate']) <= timedelta(days=30)] # reduce for wide > 30 days
# print(newx.Acct)
if newx.Acct.count() > 2: # check more than two dates exist
newx['Points'].iloc[0] = .4 # first row
newx['Points'].iloc[-1] = .4 # last row
newx['Points'].iloc[1:-1] = .2 / newx['Points'].iloc[1:-1].count() # middle rows / by count of those rows
return newx
elif newx.Acct.count() == 2: # placeholder for later
#edge case logic here for two occurences
return newx
elif newx.Acct.count() == 1: # placeholder for later
#edge case logic here one onccurence
return newx
# groupby Acct then clean up the indices so it can be merged back into original df
dft = df.groupby('Acct', as_index=False).apply(points).reset_index().set_index('level_1').drop('level_0', axis=1)
# merge on index
df_points = df[['Acct', 'ContactDate', 'OpenDate']].merge(dft['Points'], how='left', left_index=True, right_index=True).fillna(0)
不清楚指定点是什么意思。你能举例说明你希望输出是什么样子吗?是的。我将发布编辑如果第一次或最后一次触摸与其他触摸相同会发生什么?e、 g.如果一个账户有3个对应关系,并且所有这些对应关系都发生在同一天,该怎么办?如果这不是一个问题,那么乔纳森下面的回答似乎可以解决这个问题!非常感谢。永远不会想出重置索引/设置索引的方法不客气。干杯
Acct ContactDate OpenDate Points
0 123 2018-01-01 2021-01-01 0.0
1 123 2020-12-10 2021-01-01 0.4
2 123 2020-12-11 2021-01-01 0.2
3 123 2020-12-12 2021-01-01 0.4
4 456 2018-01-01 2021-01-01 0.0
5 456 2020-12-10 2021-01-01 0.4
6 456 2020-12-11 2021-01-01 0.1
7 456 2020-12-11 2021-01-01 0.1
8 456 2020-12-12 2021-01-01 0.4