Python pandas-计数高于/低于当前行的值的条纹

Python pandas-计数高于/低于当前行的值的条纹,python,pandas,time-series,series,Python,Pandas,Time Series,Series,我正在寻找一种方法来获取pandas系列并返回新系列,该系列表示高于/低于系列中每一行的先前连续值的数量: a = pd.Series([30, 10, 20, 25, 35, 15]) …应输出: Value Higher than streak Lower than streak 30 0 0 10 0 1 20 1 0 25 2

我正在寻找一种方法来获取pandas系列并返回新系列,该系列表示高于/低于系列中每一行的先前连续值的数量:

a = pd.Series([30, 10, 20, 25, 35, 15])
…应输出:

Value   Higher than streak  Lower than streak
30      0                   0
10      0                   1
20      1                   0
25      2                   0
35      4                   0
15      0                   3
这将允许有人确定每个“区域最大/最小”值在时间序列中的重要性

提前感谢。

将熊猫作为pd导入
import pandas as pd
import numpy as np

value = pd.Series([30, 10, 20, 25, 35, 15])



Lower=[(value[x]<value[:x]).sum() for x in range(len(value))]
Higher=[(value[x]>value[:x]).sum() for x in range(len(value))]


df=pd.DataFrame({"value":value,"Higher":Higher,"Lower":Lower})

print(df)





      Lower  Higher  value
0       0      0     30
1       1      0     10
2       1      1     20
3       1      2     25
4       0      4     35
5       4      1     15
将numpy作为np导入 值=pd.系列([30,10,20,25,35,15]) 下限=[(值[x]值[:x])。范围(len(值))]内x的sum() df=pd.DataFrame({“value”:value,“Higher”:Higher,“Lower”:Lower}) 打印(df) 低值高值 0 0 0 30 1 1 0 10 2 1 1 20 3 1 2 25 4 0 4 35 5 4 1 15
编辑:更新为真正计算连续值。我想不出一个可行的解决方案,所以我们又回到了循环

df = pd.Series(np.random.rand(10000))

def count_bigger_consecutives(values):
  length = len(values)
  result = np.zeros(length)
  for i in range(length):
    for j in range(i):
      if(values[i]>values[j]):
        result[i] += 1
      else:
        break
  return result

%timeit count_bigger_consecutives(df.values)
1 loop, best of 3: 365 ms per loop
如果性能是您关心的问题,那么可以使用一个用于python代码的即时编译器来归档加速。在这个例子中,你真的可以看到numba shine:

from numba import jit 
@jit(nopython=True)
def numba_count_bigger_consecutives(values):
  length = len(values)
  result = np.zeros(length)
  for i in range(length):
    for j in range(i):
      if(values[i]>values[j]):
        result[i] += 1
      else:
        break
  return result

%timeit numba_count_bigger_consecutives(df.values)
The slowest run took 543.09 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 161 µs per loop

下面是一位同事提出的解决方案(可能不是最有效的,但它确实做到了):

输入数据 创建“更高”列 合并两个新系列
这是我的解决方案-它有一个循环,但迭代次数将仅为最大条纹长度。它存储是否已计算每行的条纹的状态,并在计算完成后停止。它使用shift来测试前一行是否较高/较低,并不断增加shift,直到找到所有条纹

a = pd.Series([30, 10, 20, 25, 35, 15, 15])

a_not_done_greater = pd.Series(np.ones(len(a))).astype(bool)
a_not_done_less = pd.Series(np.ones(len(a))).astype(bool)

a_streak_greater = pd.Series(np.zeros(len(a))).astype(int)
a_streak_less = pd.Series(np.zeros(len(a))).astype(int)

s = 1
not_done_greater = True
not_done_less = True

while not_done_greater or not_done_less:
    if not_done_greater:
        a_greater_than_shift = (a > a.shift(s))
        a_streak_greater = a_streak_greater + (a_not_done_greater.astype(int) * a_greater_than_shift)
        a_not_done_greater = a_not_done_greater & a_greater_than_shift
        not_done_greater = a_not_done_greater.any()

    if not_done_less:
        a_less_than_shift = (a < a.shift(s))
        a_streak_less = a_streak_less + (a_not_done_less.astype(int) * a_less_than_shift)
        a_not_done_less = a_not_done_less & a_less_than_shift
        not_done_less = a_not_done_less.any()

    s = s + 1


res = pd.concat([a, a_streak_greater, a_streak_less], axis=1)
res.columns = ['value', 'greater_than_streak', 'less_than_streak']
print(res)

由于您要回顾前面的值以查看是否有连续的值,因此必须以某种方式与索引交互。此解决方案首先查看当前索引中该值之前的任何值,以查看它们是否小于或大于该值,然后将任何值设置为False(如果后面有False)。它还避免了在数据帧上创建迭代器,这可能会加快大型数据集的操作

import pandas as pd
from operator import gt, lt

a = pd.Series([30, 10, 20, 25, 35, 15])

def consecutive_run(op, ser, i):
    """
    Sum the uninterrupted consecutive runs at index i in the series where the previous data
    was true according to the operator.
    """
    thresh_all = op(ser[:i], ser[i])
    # find any data where the operator was not passing.  set the previous data to all falses
    non_passing = thresh_all[~thresh_all]
    start_idx = 0
    if not non_passing.empty:
        # if there was a failure, there was a break in the consecutive truth values,
        # so get the final False position. Starting index will be False, but it
        # will either be at the end of the series selection and will sum to zero
        # or will be followed by all successive True values afterwards
        start_idx = non_passing.index[-1]
    # count the consecutive runs by summing from the start index onwards
    return thresh_all[start_idx:].sum()


res = pd.concat([a, a.index.to_series().map(lambda i: consecutive_run(gt, a, i)),
                 a.index.to_series().map(lambda i: consecutive_run(lt, a, i))],
       axis=1)
res.columns = ['Value', 'Higher than streak', 'Lower than streak']
print(res)
结果:


谢谢你的回答。不幸的是,这个解决方案没有达到我所期望的结果,因为每一行都应该只对其前面的行进行评估。e、 g.在第二次观察中,10比30低-因此下栏=1,上栏=0。可能你必须根据你假设的逻辑更改上下栏的名称。假设这是一个更“Pythonic”的解决方案谢谢。非常有趣,我不熟悉扩展()。然而,这并不完全是预期的行为。我需要知道在我的时间序列中,仍然会使当前行=max()或min()的连续过去观测的最大数量。哇。那要快得多。感谢分享此解决方案。不幸的是,结果显示为数组([0,0,0,0,0,4,0.]),而我预期为0,0,1,2,4,0。由于解决方案似乎仍然需要循环,因此您建议使用numba仍然非常有用。谢谢,我认为我们不会找到避免循环的解决方案。更新为使用更有效的求和算法,只需获取接近连续值的值,然后求和。
c = []

for idx, value in enumerate(a):
    count = 0
    for i in range(idx, 0, -1):
        if value > a.loc[i-1]:
            break
        count += 1
    c.append([value, count])

lower = pd.DataFrame(c, columns=['Value', 'Lower'])
print(pd.merge(higher, lower, on='Value'))

   Value  Higher  Lower
0     30       0      0
1     10       0      1
2     20       1      0
3     25       2      0
4     35       4      0
5     15       0      3
a = pd.Series([30, 10, 20, 25, 35, 15, 15])

a_not_done_greater = pd.Series(np.ones(len(a))).astype(bool)
a_not_done_less = pd.Series(np.ones(len(a))).astype(bool)

a_streak_greater = pd.Series(np.zeros(len(a))).astype(int)
a_streak_less = pd.Series(np.zeros(len(a))).astype(int)

s = 1
not_done_greater = True
not_done_less = True

while not_done_greater or not_done_less:
    if not_done_greater:
        a_greater_than_shift = (a > a.shift(s))
        a_streak_greater = a_streak_greater + (a_not_done_greater.astype(int) * a_greater_than_shift)
        a_not_done_greater = a_not_done_greater & a_greater_than_shift
        not_done_greater = a_not_done_greater.any()

    if not_done_less:
        a_less_than_shift = (a < a.shift(s))
        a_streak_less = a_streak_less + (a_not_done_less.astype(int) * a_less_than_shift)
        a_not_done_less = a_not_done_less & a_less_than_shift
        not_done_less = a_not_done_less.any()

    s = s + 1


res = pd.concat([a, a_streak_greater, a_streak_less], axis=1)
res.columns = ['value', 'greater_than_streak', 'less_than_streak']
print(res)
   value  greater_than_streak  less_than_streak
0     30                    0                 0
1     10                    0                 1
2     20                    1                 0
3     25                    2                 0
4     35                    4                 0
5     15                    0                 3
6     15                    0                 0
import pandas as pd
from operator import gt, lt

a = pd.Series([30, 10, 20, 25, 35, 15])

def consecutive_run(op, ser, i):
    """
    Sum the uninterrupted consecutive runs at index i in the series where the previous data
    was true according to the operator.
    """
    thresh_all = op(ser[:i], ser[i])
    # find any data where the operator was not passing.  set the previous data to all falses
    non_passing = thresh_all[~thresh_all]
    start_idx = 0
    if not non_passing.empty:
        # if there was a failure, there was a break in the consecutive truth values,
        # so get the final False position. Starting index will be False, but it
        # will either be at the end of the series selection and will sum to zero
        # or will be followed by all successive True values afterwards
        start_idx = non_passing.index[-1]
    # count the consecutive runs by summing from the start index onwards
    return thresh_all[start_idx:].sum()


res = pd.concat([a, a.index.to_series().map(lambda i: consecutive_run(gt, a, i)),
                 a.index.to_series().map(lambda i: consecutive_run(lt, a, i))],
       axis=1)
res.columns = ['Value', 'Higher than streak', 'Lower than streak']
print(res)
   Value  Higher than streak  Lower than streak
0     30                   0                  0
1     10                   1                  0
2     20                   0                  1
3     25                   0                  2
4     35                   0                  4
5     15                   3                  0