Python 基于序列中值的顺序对循环进行矢量化_Python_Pandas_Performance_Numpy_Dataframe

Python 基于序列中值的顺序对循环进行矢量化

python pandas performance numpy dataframe

Python 基于序列中值的顺序对循环进行矢量化,python,pandas,performance,numpy,dataframe,Python,Pandas,Performance,Numpy,Dataframe,这个问题是基于我回答的一个问题输入如下所示： Index Results Price 0 Buy 10 1 Sell 11 2 Buy 12 3 Neutral 13 4 Buy 14 5 Sell 15 我需要找到每一个买卖序列（忽略序列外的额外买卖值），并计算价格差异所需输出： Index Results Price Difference 0 Buy

这个问题是基于我回答的一个问题

输入如下所示：

Index   Results  Price
0       Buy      10
1       Sell     11
2       Buy      12
3       Neutral  13
4       Buy      14
5       Sell     15

我需要找到每一个买卖序列（忽略序列外的额外买卖值），并计算价格差异

所需输出：

Index Results Price Difference
0     Buy     10    
1     Sell    11    1
2     Buy     12    
3     Neutral 13    
4     Buy     14    
5     Sell    15    3

我的解决方案冗长，但似乎有效：

from numba import njit

@njit
def get_diffs(results, prices):
    res = np.full(prices.shape, np.nan)
    prev_one, prev_zero = True, False
    for i in range(len(results)):
        if prev_one and (results[i] == 0):
            price_start = prices[i]
            prev_zero, prev_one = True, False
        elif prev_zero and (results[i] == 1):
            res[i] = prices[i] - price_start
            prev_zero, prev_one = False, True
    return res

results = df['Results'].map({'Buy': 0, 'Sell': 1})

df['Difference'] = get_diffs(results.values, df['Price'].values)

有矢量化的方法吗？我关心大量行的代码可维护性和性能

编辑：基准测试代码：

df = pd.DataFrame.from_dict({'Index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5},
                             'Results': {0: 'Buy', 1: 'Sell', 2: 'Buy', 3: 'Neutral', 4: 'Buy', 5: 'Sell'},
                             'Price': {0: 10, 1: 11, 2: 12, 3: 13, 4: 14, 5: 15}})

df = pd.concat([df]*10**4, ignore_index=True)

def jpp(df):
    results = df['Results'].map({'Buy': 0, 'Sell': 1})    
    return get_diffs(results.values, df['Price'].values)

%timeit jpp(df)  # 7.99 ms ± 142 µs per loop

使用

cumcount

查找配对：

s=df.groupby('Results').cumcount()
df['Diff']=df.Price.groupby(s).diff().loc[df.Results.isin(['Buy','Sell'])]
df
Out[596]: 
   Index  Results  Price  Diff
0      0      Buy     10   NaN
1      1     Sell     11   1.0
2      2      Buy     12   NaN
3      3  Neutral     13   NaN
4      4      Buy     14   NaN
5      5     Sell     15   3.0

稍后，我将使用scipy和numpy编写一些备选方案，但这里有一个明确、直截了当的答案，就是提出一个矢量化的备选方案，尽管这在性能方面仍然落后于

numba

如果我正确地理解了这个问题，就会出现一个“买入”，后面跟着许多可能的选择，然后最后会出现一个“卖出”，你想找出第一个“买入”和“卖出”之间的区别。然后另一个“买入”将开始，等等

您可以使用

cumsum

和

shift

创建要分组的序列：

a = df.Results.shift().eq('Sell').cumsum()

接下来，您可以使用

agg

查找每组的第一个和最后一个值：

agr = df.groupby(a).Price.agg(['first', 'last'])

最后，我们可以使用

loc

指定一个新列：

df.loc[df.Results.eq('Sell'), 'Diff'] = agr['last'].sub(agr['first']).values

性能

我实际上无法运行您的代码，我有一个

打字员

，所以我无法比较。

很好！这就解决了可读性危机。但我发现这比

numba

慢得多：（.E.on

pd.concat（[df]*10**4，ignore_index=True）

，8.33ms vs 6.44s。@jpp我认为在熊猫中，矢量化并不总是意味着有效：-）对输入进行评论。买卖可以在任何地方进行。我们只对买-卖-买-卖-买-卖等感兴趣。。。不跳过任何可行组合的序列。这意味着许多与此序列不匹配的买卖可以被忽略。我对你得到的

打字机错误感到困惑，看不出它会发生在哪里。你能举个例子吗？键入错误位于函数“无法识别np.object类型的全局签名”的第一行。当我测试时，两个输入都是float或int，也许提供计时设置会有帮助，我会尝试验证againI，我认为您的解决方案在结果方面应该是好的。为了提高性能，我更新了我的问题，加入了我的基准测试代码。
df.loc[df.Results.eq('Sell'), 'Diff'] = agr['last'].sub(agr['first']).values

   Index  Results  Price  Diff
0      0      Buy     10   NaN
1      1     Sell     11   1.0
2      2      Buy     12   NaN
3      3  Neutral     13   NaN
4      4      Buy     14   NaN
5      5     Sell     15   3.0

In [27]: df = pd.concat([df]*10**4, ignore_index=True)

In [28]: %%timeit
    ...: a = df.Results.shift().eq('Sell').cumsum()
    ...: agr = df.groupby(a).Price.agg(['first', 'last'])
    ...: df.loc[df.Results.eq('Sell'), 'Diff'] = agr['last'].sub(agr['first']).values
    ...:
17.6 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [29]: %%timeit
    ...: s=df.groupby('Results').cumcount()
    ...: df['Diff']=df.Price.groupby(s).diff().loc[df.Results.isin(['Buy','Sell'])]
    ...:
3.71 s ± 331 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)