Python 是否可以在此代码逻辑上使用应用函数或矢量化?
我正在计算期末余额 输入数据帧:Python 是否可以在此代码逻辑上使用应用函数或矢量化?,python,pandas,numpy,vectorization,apply,Python,Pandas,Numpy,Vectorization,Apply,我正在计算期末余额 输入数据帧: open inOut close 0 3 100 0 1 0 300 0 2 0 200 0 3 0 230 0 4 0 150 0 输出数据帧 open inOut close 0 3 100 103 1 103 300 403 2 403 200
open inOut close
0 3 100 0
1 0 300 0
2 0 200 0
3 0 230 0
4 0 150 0
输出数据帧
open inOut close
0 3 100 103
1 103 300 403
2 403 200 603
3 603 230 833
4 833 150 983
我可以使用roughfor loop来实现这一点,为了优化它,我使用了iterrow()
用于循环
%%timeit
for i in range(len(df.index)):
if i>0:
df.iloc[i]['open'] = df.iloc[i-1]['close']
df.iloc[i]['close'] = df.iloc[i]['open']+df.iloc[i]['inOut']
else:
df.iloc[i]['close'] = df.iloc[i]['open']+df.iloc[i]['inOut']
1.64 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
for index,row in dfOg.iterrows():
if index>0:
row['open'] = dfOg.iloc[index-1]['close']
row['close'] = row['open']+row['inOut']
else:
row['close'] = row['open']+row['inOut']
627 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
i如箭头所示
%%timeit
for i in range(len(df.index)):
if i>0:
df.iloc[i]['open'] = df.iloc[i-1]['close']
df.iloc[i]['close'] = df.iloc[i]['open']+df.iloc[i]['inOut']
else:
df.iloc[i]['close'] = df.iloc[i]['open']+df.iloc[i]['inOut']
1.64 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
for index,row in dfOg.iterrows():
if index>0:
row['open'] = dfOg.iloc[index-1]['close']
row['close'] = row['open']+row['inOut']
else:
row['close'] = row['open']+row['inOut']
627 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
性能从1.64ms->627µs优化
因此,我正在努力找出如何使用apply()和矢量化编写上述逻辑。对于矢量化,我尝试移动列,但无法实现所需的输出 编辑:我更改了周围的内容,以匹配OP对问题所做的编辑 您可以以矢量化的方式执行您想要的操作,而无需任何类似以下的循环:
import pandas as pd
d = {'open': [3] + [0]*4, 'inOut': [100, 300, 200, 230, 150], 'close': [0]*5}
df = pd.DataFrame(d)
df['close'].values[:] = df['open'].values[0] + df['inOut'].values.cumsum()
df['open'].values[1:] = df['close'].values[:-1]
使用%%timeit进行计时
:
529 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
输出:
close inOut open
0 103 100 3
1 403 300 103
2 603 200 403
3 833 230 603
4 983 150 833
open inOut close
0 3.0 100 103.0
1 100.0 300 300.0
2 300.0 200 200.0
3 200.0 230 230.0
4 230.0 150 150.0
因此,以这种方式对代码进行矢量化确实要快一些。事实上,它可能会尽可能快。通过对数据帧创建代码计时,您可以看到这一点:
%%timeit
d = {'open': [3] + [0]*4, 'inOut': [100, 300, 200, 230, 150], 'close': [0]*5}
df = pd.DataFrame(d)
结果:
367 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
减去创建数据帧所需的时间,填充数据帧的矢量化版本只需约160µs。您可以使用
np。其中
%%timeit
df['open'] = np.where(df.index==0, df['open'], df['inOut'].shift())
df['close'] = df['open'] + df['inOut']
# 1.07 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
输出:
close inOut open
0 103 100 3
1 403 300 103
2 603 200 403
3 833 230 603
4 983 150 833
open inOut close
0 3.0 100 103.0
1 100.0 300 300.0
2 300.0 200 200.0
3 200.0 230 230.0
4 230.0 150 150.0
对不起,我在期末余额逻辑中犯了一个愚蠢的错误。。apply
不是vectorization@juanpa.arrivillaga是的,我同意,但根据我提到的博客,apply比iterrows()快,你应该使用itertuples
,apply不会比这更快。请注意,您的iterrows
版本不起作用,它不会修改原始数据帧谢谢,@juanpa.arrivillaga我也会检查itertuples的性能。这很顺利,但速度很慢。我猜从数组结构来看,np.where
可以吗?@tel是的,它比你的答案慢了一点,因为np中有条件检查。where
请重新考虑这个问题。顺便说一句,我喜欢这种简单的方法,但我怀疑这种方法是否适用于计算期末余额。