Python 如何找到多个列的最大连续数？_Python_Pandas_Pandas Groupby_Cumsum

Python 如何找到多个列的最大连续数？

python pandas

Python 如何找到多个列的最大连续数？,python,pandas,pandas-groupby,cumsum,Python,Pandas,Pandas Groupby,Cumsum,我需要确定满足多个列的特定条件的最大连续值数如果我的df是： A B C D E 26 24 21 23 24 26 23 22 15 23 24 19 17 11 15 27 22 28 24 24 26 27 30 23 11 26 26 29 27 29 我想知道每列出现超过25个数字的最大连续次数。因此，输出将是： A 3 B 2 C 3 D 1 E

我需要确定满足多个列的特定条件的最大连续值数

如果我的df是：

A    B    C    D    E
26   24   21   23   24
26   23   22   15   23 
24   19   17   11   15     
27   22   28   24   24 
26   27   30   23   11 
26   26   29   27   29

我想知道每列出现超过25个数字的最大连续次数。因此，输出将是：

A 3
B 2
C 3
D 1
E 1

使用以下代码，我可以一次获得一列的结果；有没有办法像上面那样创建一个表，而不是为每一列重复（我总共有40多列）

提前感谢。

使用

numpy

计算最大连续时间的一个选项：

def max_consecutive(arr):
    # calculate the indices where the condition changes
    split_indices = np.flatnonzero(np.ediff1d(arr.values, to_begin=1, to_end=1))

    # calculate the chunk length of consecutive values and pick every other value based on 
    # the initial value
    try:
        max_size = np.diff(split_indices)[not arr.iat[0]::2].max()
    except ValueError:
        max_size = 0
    return max_size

df.gt(25).apply(max_consecutive)
#A    3
#B    2
#C    3
#D    1
#E    1
#dtype: int64

与其他方法相比，时间安排：

%timeit df.gt(25).apply(max_consecutive)
# 520 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
# 10.3 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.gt（25）.应用（最大连续）
#每个回路520µs±6.92µs（7次运行的平均值±标准偏差，每个1000个回路）
%timeit（df>25）.apply（lambda x:x.groupby（x.diff（）.ne（0）.cumsum（））.cumcount（）+1）.mask（df这是你想要的吗？pandas
方法（PS:没想到我能做到一行LOL）
（df>25）.apply（lambda x:x.groupby（x.diff（）.ne（0.cumsum（））.cumcount（）+1）.mask（df这是一个带有NumPy的-
# mask is 2D boolean array representing islands as True values per col
def max_island_len_cols(mask):
    m,n = mask.shape
    out = np.zeros(n,dtype=int)
    b = np.zeros((m+2,n),dtype=bool)
    b[1:-1] = mask
    for i in range(mask.shape[1]):
        idx = np.flatnonzero(b[1:,i] != b[:-1,i])
        if len(idx)>0:
            out[i] = (idx[1::2] - idx[::2]).max()
    return out

output = pd.Series(max_island_len_cols(df.values>25), index=df.columns)

样本运行-
In [690]: df
Out[690]: 
    A   B   C   D   E
0  26  24  21  23  24
1  26  23  22  15  23
2  24  19  17  11  15
3  27  22  28  24  24
4  26  27  30  23  11
5  26  26  29  27  29

In [690]: 

In [691]: pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Out[691]: 
A    3
B    2
C    3
D    1
E    1
dtype: int64


运行时测试
受给定示例的启发，该示例的数字范围为（24,28）
和40
cols，让我们设置一个更大的输入数据框，并测试所有解决方案-
# Input dataframe
In [692]: df = pd.DataFrame(np.random.randint(24,28,(1000,40)))

# Proposed in this post
In [693]: %timeit pd.Series(max_island_len_cols(df.values>25), index=df.columns)
1000 loops, best of 3: 539 µs per loop

# @Psidom's solution
In [694]: %timeit df.gt(25).apply(max_consecutive)
1000 loops, best of 3: 1.81 ms per loop

# @Wen's solution
In [695]: %timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
10 loops, best of 3: 95.2 ms per loop

#输入数据帧
在[692]中：df=pd.DataFrame（np.random.randint（24,28，（1000,40）））
#在这个职位上提议
在[693]中：%timeit pd.系列（最大长度（df.值>25），索引=df.列）
1000个回路，最好为3个：每个回路539µs
#@Psidom的解决方案
在[694]中：%timeit df.gt（25）.apply（最大连续）
1000圈，最佳3圈：每圈1.81毫秒
#@Wen的解决方案
在[695]：%timeit（df>25）.应用（lambda x:x.groupby（x.diff（）.ne（0.cumsum（））.cumcount（）+1）.遮罩（df一种使用pandas
和scipy.ndimage.label
的方法，用于娱乐
import pandas as pd
from scipy.ndimage import label

struct = [[0, 1, 0],     # Structure used for segmentation
          [0, 1, 0],     # Equivalent to axis=0 in `numpy`
          [0, 1, 0]]     # Or 'columns' in `pandas`

labels, nlabels = label(df > 25, structure=struct)

>>> labels               # Labels for each column-wise block of consecutive numbers > 25
Out[]:
array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [2, 0, 3, 0, 0],
       [2, 4, 3, 0, 0],
       [2, 4, 3, 5, 6]])

labels_df = pd.DataFrame(columns=df.columns, data=labels)  # Add original columns names

res = (labels_df.apply(lambda x: x.value_counts())  # Execute `value_counts` on each column
                .iloc[1:]                           # slice results for labels > 0
                .max())                             # and get max value

>>> res
Out[]:
A    3.0
B    2.0
C    3.0
D    1.0
E    1.0
dtype: float64

你能解释一下你是怎么得到3分的吗？这是个好问题：）你们能不能停止欺负我糟糕的方法（开玩笑）upvoted@Wen这就是一行程序的结果；）出于兴趣，如果引用点是另一列，而不是设置为>25，我应该如何更改代码。例如，B、C、D和E列中的第一行是否大于A列中的同一行？@MarandaRidgwaydf.subtract（df.A，axis=0）。gt（0）
# Input dataframe
In [692]: df = pd.DataFrame(np.random.randint(24,28,(1000,40)))

# Proposed in this post
In [693]: %timeit pd.Series(max_island_len_cols(df.values>25), index=df.columns)
1000 loops, best of 3: 539 µs per loop

# @Psidom's solution
In [694]: %timeit df.gt(25).apply(max_consecutive)
1000 loops, best of 3: 1.81 ms per loop

# @Wen's solution
In [695]: %timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
10 loops, best of 3: 95.2 ms per loop

import pandas as pd
from scipy.ndimage import label

struct = [[0, 1, 0],     # Structure used for segmentation
          [0, 1, 0],     # Equivalent to axis=0 in `numpy`
          [0, 1, 0]]     # Or 'columns' in `pandas`

labels, nlabels = label(df > 25, structure=struct)

>>> labels               # Labels for each column-wise block of consecutive numbers > 25
Out[]:
array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [2, 0, 3, 0, 0],
       [2, 4, 3, 0, 0],
       [2, 4, 3, 5, 6]])

labels_df = pd.DataFrame(columns=df.columns, data=labels)  # Add original columns names

res = (labels_df.apply(lambda x: x.value_counts())  # Execute `value_counts` on each column
                .iloc[1:]                           # slice results for labels > 0
                .max())                             # and get max value

>>> res
Out[]:
A    3.0
B    2.0
C    3.0
D    1.0
E    1.0
dtype: float64