Python 在数据帧中查找连续索引的开始和结束索引_Python_Pandas

Python 在数据帧中查找连续索引的开始和结束索引

python pandas

Python 在数据帧中查找连续索引的开始和结束索引,python,pandas,Python,Pandas,我有以下数据帧： A B C 0 1 1 1 1 0 1 0 2 1 1 1 3 1 0 1 4 1 1 0 5 1 1 0 6 0 1 1 7 0 1 0 其中我想知道每列3个或更多连续值的值为1时的开始和结束索引。预期结果： Column From To A 2 5

我有以下数据帧：

     A    B    C
0    1    1    1
1    0    1    0
2    1    1    1
3    1    0    1
4    1    1    0
5    1    1    0 
6    0    1    1
7    0    1    0

其中我想知道每列3个或更多连续值的值为1时的开始和结束索引。预期结果：

Column    From    To    
     A       2     5
     B       1     3         
     B       4     7

首先，我过滤掉3个或更多值不连续的值

filtered_df = df.copy().apply(filter, threshold=3)

在哪里

filtered\u df

现在看起来：

     A    B    C
0    0    1    0
1    0    1    0
2    1    1    0
3    1    0    0
4    1    1    0
5    1    1    0 
6    0    1    0
7    0    1    0

如果dataframe只有一列包含0和1，则可以实现如中所示的结果。但是，我正在努力同时对多个列执行类似的操作。

您可以使用它在数据帧上创建一个窗口。然后，您可以将所有条件和窗口应用回其开始位置：

length = 3
window = df.rolling(length)
mask = (window.min() == 1) & (window.max() == 1)
mask = mask.shift(1 - length)
print(mask)

其中打印：

       A      B      C
0  False   True  False
1  False  False  False
2   True  False  False
3   True  False  False
4  False   True  False
5  False   True  False
6    NaN    NaN    NaN
7    NaN    NaN    NaN

用于所有

数据帧的应用功能
在第一个解决方案中，获得每列连续1
的第一个和最后一个值，将输出添加到列表和最后一个concat
：
def f(df, threshold=3): 
    out = []
    for col in df.columns:
        m = df[col].eq(1)
        g = (df[col] != df[col].shift()).cumsum()[m]
        mask = g.groupby(g).transform('count').ge(threshold)
        filt = g[mask].reset_index()
        output = filt.groupby(col)['index'].agg(['first','last'])
        output.insert(0, 'col', col)
        out.append(output)

    return pd.concat(out, ignore_index=True)

或者首先通过取消堆叠
重塑形状，然后应用解决方案：
def f(df, threshold=3):

    df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
    m = df1['val'].eq(1)
    g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
    mask = g.groupby(g).transform('count').ge(threshold) & m
    return (df1[mask].groupby([df1['col'], g])['idx']
                    .agg(['first','last'])
                    .reset_index(level=1, drop=True)
                    .reset_index())


filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7

也许将代码打包成一个函数，然后将该函数作为一个整体应用于数据帧？当然，您需要扩展filter函数，将其应用于df.columns中的每个列。谢谢！两种方法都有效。其中一个比另一个好吗？@Peter-Hard问，如果有很多小组，很多专栏，第二个应该更慢。最好的测试是在真实数据中同时进行。
def f(df, threshold=3):

    df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
    m = df1['val'].eq(1)
    g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
    mask = g.groupby(g).transform('count').ge(threshold) & m
    return (df1[mask].groupby([df1['col'], g])['idx']
                    .agg(['first','last'])
                    .reset_index(level=1, drop=True)
                    .reset_index())


filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7

filtered_df = df.pipe(f, threshold=2)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7
3   C      2     3