Python 通过对其他列进行一些计算,有没有最快的方法将列添加到数据集中?
我已经编写了一段代码(见下文)对数据集进行一些计算,并将结果作为一列添加到数据集中Python 通过对其他列进行一些计算,有没有最快的方法将列添加到数据集中?,python,pandas,jupyter-notebook,Python,Pandas,Jupyter Notebook,我已经编写了一段代码(见下文)对数据集进行一些计算,并将结果作为一列添加到数据集中 ratio_list = [] for s,p,f in zip(A["s"], A["p"], A["f"]): m = A[(A["s"]==s) & (A["p"]==p) & (A["f"]<f)][['a', 't']].product(axis=1).sum() n = A[(A["s"]==s) & (A["p"]==p) & (A["f"]&
ratio_list = []
for s,p,f in zip(A["s"], A["p"], A["f"]):
m = A[(A["s"]==s) & (A["p"]==p) & (A["f"]<f)][['a', 't']].product(axis=1).sum()
n = A[(A["s"]==s) & (A["p"]==p) & (A["f"]<f)]['a'].sum()
if(n==0):
ratio_list.append(0)
else:
ratio_list.append(m/n)
A["ratio"] = ratio_list
按
s
和p
列分组使用此自定义功能,主要是:
你们能给这个问题添加一些样本数据吗?使用df.apply…..这是一个非常快速的共享可测试数据集,并且是预期的result@Shrey我不会说apply方法很快,例如:agree,即使我最近从代码中删除了它。但是在某些用例中apply工作得非常好。看看这里的问题,假设df.apply对他来说可能是最简单的解决方案(最短6秒):
,s,p,f,a,t,ratio
0,101,2018,2018-01-06,2.0,10.0,13.0
1,101,2018,2018-01-06,2.0,12.0,13.0
2,101,2018,2018-01-03,4.0,14.0,0.0
3,101,2018,2018-01-03,16.0,12.0,0.0
4,101,2018,2018-01-03,12.0,14.0,0.0
5,101,2018,2018-01-06,4.0,10.0,13.0
6,101,2018,2018-01-06,14.0,23.0,13.0
7,101,2018,2018-01-08,4.0,10.0,15.222222222222221
8,101,2018,2018-01-08,20.0,14.0,15.222222222222221
9,101,2018,2018-01-08,21.0,23.0,15.222222222222221
10,101,2018,2018-01-08,21.0,23.0,15.222222222222221
11,101,2018,2018-01-09,4.0,8.0,17.566666666666666
12,101,2018,2018-01-09,10.0,14.0,17.566666666666666
13,101,2018,2018-01-13,13.0,23.0,17.01492537313433
14,101,2018,2018-01-13,9.0,23.0,17.01492537313433
15,103,2018,2018-01-31,20.0,15.0,0.0
16,103,2018,2018-01-31,2.0,15.0,0.0
17,103,2018,2018-01-31,20.0,15.0,0.0
18,103,2018,2018-01-31,20.0,15.0,0.0
19,103,2018,2018-01-31,20.0,15.0,0.0
def ratio(x):
#2d mask for compare each value
ma = x['f'].values < x['f'].values[:, None]
#for pandas 0.24+
#ma = x['f'].to_numpy() < x['f'].to_numpy()[:, None]
#fill a and t values by mask
a = np.where(ma, x['a'], 0)
t = np.where(ma, x['t'], 0)
#multiple and sum per 'columns'
m = (a * t).sum(axis=1)
n = a.sum(axis=1)
#set column by condition
x['ratio1'] = np.where(n == 0, 0, m/n)
return x
A = A.groupby(['s','p']).apply(ratio)
print (A)
s p f a t ratio ratio1
0 101 2018 2018-01-06 2.0 10.0 13.000000 13.000000
1 101 2018 2018-01-06 2.0 12.0 13.000000 13.000000
2 101 2018 2018-01-03 4.0 14.0 0.000000 0.000000
3 101 2018 2018-01-03 16.0 12.0 0.000000 0.000000
4 101 2018 2018-01-03 12.0 14.0 0.000000 0.000000
5 101 2018 2018-01-06 4.0 10.0 13.000000 13.000000
6 101 2018 2018-01-06 14.0 23.0 13.000000 13.000000
7 101 2018 2018-01-08 4.0 10.0 15.222222 15.222222
8 101 2018 2018-01-08 20.0 14.0 15.222222 15.222222
9 101 2018 2018-01-08 21.0 23.0 15.222222 15.222222
10 101 2018 2018-01-08 21.0 23.0 15.222222 15.222222
11 101 2018 2018-01-09 4.0 8.0 17.566667 17.566667
12 101 2018 2018-01-09 10.0 14.0 17.566667 17.566667
13 101 2018 2018-01-13 13.0 23.0 17.014925 17.014925
14 101 2018 2018-01-13 9.0 23.0 17.014925 17.014925
15 103 2018 2018-01-31 20.0 15.0 0.000000 0.000000
16 103 2018 2018-01-31 2.0 15.0 0.000000 0.000000
17 103 2018 2018-01-31 20.0 15.0 0.000000 0.000000
18 103 2018 2018-01-31 20.0 15.0 0.000000 0.000000
19 103 2018 2018-01-31 20.0 15.0 0.000000 0.000000