Python 在某些累积更改后更改值
我有以下数据:Python 在某些累积更改后更改值,python,pandas,numpy,vectorization,numba,Python,Pandas,Numpy,Vectorization,Numba,我有以下数据: data = [0.1, 0.2, 0.3, 0.4 , 0.5, 0.6, 0.7, 0.8, 0.5, 0.2, 0.1, -0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.9, -1.2, -0.1, -0.7] 每次数据点的变化超过步长时,我都要记录它。如果不是,我想保留旧的,直到累积变化至少与步长一样大。 我是这样反复实现的: import pandas as pd from copy import deepco
data = [0.1, 0.2, 0.3, 0.4 , 0.5, 0.6, 0.7, 0.8, 0.5, 0.2, 0.1, -0.1,
-0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.9, -1.2, -0.1, -0.7]
每次数据点的变化超过步长时,我都要记录它。如果不是,我想保留旧的,直到累积变化至少与步长一样大。
我是这样反复实现的:
import pandas as pd
from copy import deepcopy
import numpy as np
step = 0.5
df_steps = pd.Series(data)
df = df_steps.copy()
today = None
yesterday = None
for index, value in df_steps.iteritems():
today = deepcopy(index)
if today is not None and yesterday is not None:
if abs(df.loc[today] - df_steps.loc[yesterday]) > step:
df_steps.loc[today] = df.loc[today]
else:
df_steps.loc[today] = df_steps.loc[yesterday]
yesterday = deepcopy(today)
我的最终结果是:
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.7, 0.7, 0.7, 0.7, 0.1, 0.1, 0.1, 0.1, 0.1, -0.5, -0.5, -0.5, -0.5, -1.2, -0.1, -0.7]
问题和疑问
问题是这是迭代实现的(我同意第二个答案)。我的问题是如何以矢量化的方式实现同样的目标
尝试
我的尝试如下,但与结果不匹配:
(df.diff().cumsum().replace(np.nan, 0) / step).astype(int)
它不是矢量化,但此解决方案避免了
deepcopy()
和各种.loc
方法,因此速度应该更快:
data = [0.1, 0.2, 0.3, 0.4 , 0.5, 0.6, 0.7, 0.8, 0.5, 0.2, 0.1, -0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.9, -1.2, -0.1, -0.7]
def fn(step):
current = float('inf')
i = yield
while True:
if abs(current - i) > step:
current = i
i = yield i
else:
i = yield current
df = pd.DataFrame({'data': data})
f = fn(0.5)
next(f)
df['new_data'] = df['data'].apply(lambda x: f.send(x))
print(df)
印刷品:
data new_data
0 0.1 0.1
1 0.2 0.1
2 0.3 0.1
3 0.4 0.1
4 0.5 0.1
5 0.6 0.1
6 0.7 0.7
7 0.8 0.7
8 0.5 0.7
9 0.2 0.7
10 0.1 0.1
11 -0.1 0.1
12 -0.2 0.1
13 -0.3 0.1
14 -0.4 0.1
15 -0.5 -0.5
16 -0.6 -0.5
17 -0.7 -0.5
18 -0.9 -0.5
19 -1.2 -1.2
20 -0.1 -0.1
21 -0.7 -0.7
由于纯矢量化方法看起来并不简单,我们可以使用
numba
将代码编译到C级,因此有一种循环但非常符合共振峰的方法。这里有一种使用numba的nopython
模式的方法:
from numba import njit, float64
@njit('float64[:](float64[:], float32)')
def set_at_cum_change(a, step):
out = np.empty(len(a), dtype=float64)
prev = a[0]
out[0] = a[0]
for i in range(1,len(a)):
current = a[i]
if np.abs(current-prev) > step:
out[i] = current
prev = current
else:
out[i] = out[i-1]
return out
在同一阵列上进行的测试给出:
data = np.array([0.1, 0.2, 0.3, 0.4 , 0.5, 0.6, 0.7, 0.8, 0.5, 0.2, 0.1, -0.1,
-0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.9, -1.2, -0.1, -0.7])
out = set_at_cum_change(data,step= 0.5)
print(out)
array([ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.7, 0.7, 0.7, 0.7, 0.1,
0.1, 0.1, 0.1, 0.1, -0.5, -0.5, -0.5, -0.5, -1.2, -0.1, -0.7])
如果我们检查计时,我们会看到
110000x
在22000
长度数组上使用numba
方法的巨大加速。这不仅表明numba
在这些情况下是一种很好的方法,还表明使用:
我一定要看看麻麻最后,我听到了它的好消息+1对于基准:)@yatu,如果我将此函数应用于numpy矩阵,在列上进行另一个循环是有效的还是有更好的方法?您可以轻松适应。只需确保相应地设置出的形状即可。现在索引是在2D.lrt上。我知道@news是否有问题,这只是因为我使用数组@news:)中的第一个值初始化,所以从第二个值开始循环。不,事实上
np.empty
将是更好的选择,因为它在@news中的内存占用更少
def op(data):
step = 0.5
df_steps = pd.Series(data)
df = df_steps.copy()
today = None
yesterday = None
for index, value in df_steps.iteritems():
today = deepcopy(index)
if today is not None and yesterday is not None:
if abs(df.loc[today] - df_steps.loc[yesterday]) > step:
df_steps.loc[today] = df.loc[today]
else:
df_steps.loc[today] = df_steps.loc[yesterday]
yesterday = deepcopy(today)
return df_steps.to_numpy()
def fn(step):
current = float('inf')
i = yield
while True:
if abs(current - i) > step:
current = i
i = yield i
else:
i = yield current
def andrej(data):
df = pd.DataFrame({'data': data})
f = fn(0.5)
next(f)
df['new_data'] = df['data'].apply(lambda x: f.send(x))
data_large = np.tile(data, 1_000)
print(data_large.shape)
# (22000,)
np.allclose(op(data_large), set_at_cum_change(data_large, step=0.5))
# True
%timeit op(data_large)
# 5.78 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit andrej(data_large)
# 13.6 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit set_at_cum_change(data_large, step=0.5)
# 50.4 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)