Python 如何找到多个列的最大连续数?
我需要确定满足多个列的特定条件的最大连续值数 如果我的df是:Python 如何找到多个列的最大连续数?,python,pandas,pandas-groupby,cumsum,Python,Pandas,Pandas Groupby,Cumsum,我需要确定满足多个列的特定条件的最大连续值数 如果我的df是: A B C D E 26 24 21 23 24 26 23 22 15 23 24 19 17 11 15 27 22 28 24 24 26 27 30 23 11 26 26 29 27 29 我想知道每列出现超过25个数字的最大连续次数。因此,输出将是: A 3 B 2 C 3 D 1 E
A B C D E
26 24 21 23 24
26 23 22 15 23
24 19 17 11 15
27 22 28 24 24
26 27 30 23 11
26 26 29 27 29
我想知道每列出现超过25个数字的最大连续次数。因此,输出将是:
A 3
B 2
C 3
D 1
E 1
使用以下代码,我可以一次获得一列的结果;有没有办法像上面那样创建一个表,而不是为每一列重复(我总共有40多列)
提前感谢。使用
numpy
计算最大连续时间的一个选项:
def max_consecutive(arr):
# calculate the indices where the condition changes
split_indices = np.flatnonzero(np.ediff1d(arr.values, to_begin=1, to_end=1))
# calculate the chunk length of consecutive values and pick every other value based on
# the initial value
try:
max_size = np.diff(split_indices)[not arr.iat[0]::2].max()
except ValueError:
max_size = 0
return max_size
df.gt(25).apply(max_consecutive)
#A 3
#B 2
#C 3
#D 1
#E 1
#dtype: int64
与其他方法相比,时间安排:
%timeit df.gt(25).apply(max_consecutive)
# 520 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
# 10.3 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.gt(25).应用(最大连续)
#每个回路520µs±6.92µs(7次运行的平均值±标准偏差,每个1000个回路)
%timeit(df>25).apply(lambda x:x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df这是你想要的吗?pandas
方法(PS:没想到我能做到一行LOL)
(df>25).apply(lambda x:x.groupby(x.diff().ne(0.cumsum()).cumcount()+1).mask(df这是一个带有NumPy的-
# mask is 2D boolean array representing islands as True values per col
def max_island_len_cols(mask):
m,n = mask.shape
out = np.zeros(n,dtype=int)
b = np.zeros((m+2,n),dtype=bool)
b[1:-1] = mask
for i in range(mask.shape[1]):
idx = np.flatnonzero(b[1:,i] != b[:-1,i])
if len(idx)>0:
out[i] = (idx[1::2] - idx[::2]).max()
return out
output = pd.Series(max_island_len_cols(df.values>25), index=df.columns)
样本运行-
In [690]: df
Out[690]:
A B C D E
0 26 24 21 23 24
1 26 23 22 15 23
2 24 19 17 11 15
3 27 22 28 24 24
4 26 27 30 23 11
5 26 26 29 27 29
In [690]:
In [691]: pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Out[691]:
A 3
B 2
C 3
D 1
E 1
dtype: int64
运行时测试
受给定示例的启发,该示例的数字范围为(24,28)
和40
cols,让我们设置一个更大的输入数据框,并测试所有解决方案-
# Input dataframe
In [692]: df = pd.DataFrame(np.random.randint(24,28,(1000,40)))
# Proposed in this post
In [693]: %timeit pd.Series(max_island_len_cols(df.values>25), index=df.columns)
1000 loops, best of 3: 539 µs per loop
# @Psidom's solution
In [694]: %timeit df.gt(25).apply(max_consecutive)
1000 loops, best of 3: 1.81 ms per loop
# @Wen's solution
In [695]: %timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
10 loops, best of 3: 95.2 ms per loop
#输入数据帧
在[692]中:df=pd.DataFrame(np.random.randint(24,28,(1000,40)))
#在这个职位上提议
在[693]中:%timeit pd.系列(最大长度(df.值>25),索引=df.列)
1000个回路,最好为3个:每个回路539µs
#@Psidom的解决方案
在[694]中:%timeit df.gt(25).apply(最大连续)
1000圈,最佳3圈:每圈1.81毫秒
#@Wen的解决方案
在[695]:%timeit(df>25).应用(lambda x:x.groupby(x.diff().ne(0.cumsum()).cumcount()+1).遮罩(df一种使用pandas
和scipy.ndimage.label
的方法,用于娱乐
import pandas as pd
from scipy.ndimage import label
struct = [[0, 1, 0], # Structure used for segmentation
[0, 1, 0], # Equivalent to axis=0 in `numpy`
[0, 1, 0]] # Or 'columns' in `pandas`
labels, nlabels = label(df > 25, structure=struct)
>>> labels # Labels for each column-wise block of consecutive numbers > 25
Out[]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 3, 0, 0],
[2, 4, 3, 0, 0],
[2, 4, 3, 5, 6]])
labels_df = pd.DataFrame(columns=df.columns, data=labels) # Add original columns names
res = (labels_df.apply(lambda x: x.value_counts()) # Execute `value_counts` on each column
.iloc[1:] # slice results for labels > 0
.max()) # and get max value
>>> res
Out[]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64
你能解释一下你是怎么得到3分的吗?这是个好问题:)你们能不能停止欺负我糟糕的方法(开玩笑)upvoted@Wen这就是一行程序的结果;)出于兴趣,如果引用点是另一列,而不是设置为>25,我应该如何更改代码。例如,B、C、D和E列中的第一行是否大于A列中的同一行?@MarandaRidgwaydf.subtract(df.A,axis=0)。gt(0)
# Input dataframe
In [692]: df = pd.DataFrame(np.random.randint(24,28,(1000,40)))
# Proposed in this post
In [693]: %timeit pd.Series(max_island_len_cols(df.values>25), index=df.columns)
1000 loops, best of 3: 539 µs per loop
# @Psidom's solution
In [694]: %timeit df.gt(25).apply(max_consecutive)
1000 loops, best of 3: 1.81 ms per loop
# @Wen's solution
In [695]: %timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
10 loops, best of 3: 95.2 ms per loop
import pandas as pd
from scipy.ndimage import label
struct = [[0, 1, 0], # Structure used for segmentation
[0, 1, 0], # Equivalent to axis=0 in `numpy`
[0, 1, 0]] # Or 'columns' in `pandas`
labels, nlabels = label(df > 25, structure=struct)
>>> labels # Labels for each column-wise block of consecutive numbers > 25
Out[]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 3, 0, 0],
[2, 4, 3, 0, 0],
[2, 4, 3, 5, 6]])
labels_df = pd.DataFrame(columns=df.columns, data=labels) # Add original columns names
res = (labels_df.apply(lambda x: x.value_counts()) # Execute `value_counts` on each column
.iloc[1:] # slice results for labels > 0
.max()) # and get max value
>>> res
Out[]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64