Python 如何切片一个多索引df,保留所有值,直到满足特定条件?

Python 如何切片一个多索引df,保留所有值,直到满足特定条件?,python,pandas,slice,multi-index,Python,Pandas,Slice,Multi Index,我有一个3级多索引数据帧,我想对它进行切片,以便在满足某个条件之前保留所有值。举个例子,我有以下数据帧: Col1 Col2 Date Range Label '2018-08-01' 1 A 900 815 B 850 820 C 800 820

我有一个3级多索引数据帧,我想对它进行切片,以便在满足某个条件之前保留所有值。举个例子,我有以下数据帧:

                           Col1  Col2
Date          Range  Label
'2018-08-01'  1      A     900   815
                     B     850   820
                     C     800   820
                     D     950   840
              2      A     900   820
                     B     750   850
                     C     850   820
                     D     850   800
我想选择所有的值,直到Col1小于Col2。一旦我有了一个Col1
                           Col1  Col2
Date          Range  Label
'2018-08-01'  1      A     900   815
                     B     850   820
              2      A     900   820
我尝试了几种选择,但还没有找到好的解决办法。我可以很容易地保留Col1>Col2的所有数据,包括:

df_new=df[df['Col1']>df['Col2']]
但这不是我需要的。我还一直在考虑通过1级索引循环,并使用pd.indexlice对数据帧进行切片:

idx = pd.IndexSlice
idx_lev1=df.index.get_level_values(1).unique()

for j in (idx_lev1):
    df_lev1=df.loc[idx[:,j,:],:]
    idxs=df_lev1.index.get_level_values(2)[np.where(df_lev1['Col1']<df_lev1['Col2'])[0][0]-1]
    df_sliced= df_lev1.loc[idx[:,:,:idxs],:]
更新: 我尝试从中实现解决方案,它们与我为此创建的测试数据帧配合得很好,但与实际数据不配合。 问题是,可能存在Col1值总是大于Col2的实例,在这种情况下,我只想保留所有数据。迄今为止提出的任何解决方案都不能真正解决这一问题

对于更真实的测试用例,您可以使用以下示例:

s="""                         
Date  Range  Label  Col1  Col2
'2018-08-01'  1  1  900   815
'2018-08-01'  1  2  950   820
'2018-08-01'  1  3  900   820
'2018-08-01'  1  4  950   840
'2018-08-01'  2  1  900   820
'2018-08-01'  2  2  750   850
'2018-08-01'  2  3  850   820
'2018-08-01'  2  4  850   800
'2018-08-02'  1  1  900   815
'2018-08-02'  1  2  850   820
'2018-08-02'  1  3  800   820
'2018-08-02'  1  4  950   840
'2018-08-02'  2  1  900   820
'2018-08-02'  2  2  750   850
'2018-08-02'  2  3  850   820
'2018-08-02'  2  4  850   800
"""
或者,您可以从下载hdf文件。这是我真正使用的数据帧的子集。

我试图使用它对每一行进行编号,然后找到第一个具有正确条件的行,并使用它仅过滤数字低于该值的行

试试这个:

从集合导入defaultdict 作为pd进口熊猫 从io导入StringIO = 日期范围标签Col1 Col2 '2018-08-01' 1 1 900 815 '2018-08-01' 1 2 950 820 '2018-08-01' 1 3 900 820 '2018-08-01' 1 4 950 840 '2018-08-01' 2 1 900 820 '2018-08-01' 2 2 750 850 '2018-08-01' 2 3 850 820 '2018-08-01' 2 4 850 800 '2018-08-02' 1 1 900 815 '2018-08-02' 1 2 850 820 '2018-08-02' 1 3 800 820 '2018-08-02' 1 4 950 840 '2018-08-02' 2 1 900 820 '2018-08-02' 2 2 750 850 '2018-08-02' 2 3 850 820 '2018-08-02' 2 4 850 800 df=pd.read_csvStringIOs, sep='\s+', 索引列=[“日期”、“范围”、“标签”] groupby\u date\u range=df.groupby[日期,范围] df[cumcount]=分组依据\u日期\u范围.cumcount first_col1_lt_col2=defaultdictlambda:lendf,df[df['col1']
                          Col1  Col2
Date         Range Label            
'2018-08-01' 1     1       900   815
                   2       950   820
                   3       900   820
                   4       950   840
             2     1       900   820
'2018-08-02' 1     1       900   815
                   2       850   820
             2     1       900   820
另一种方法是使用np.where并选择第一个标记

groupby中的as_index=False使您有机会忽略groupby中的index列。看看这个

代码:

首先,我们创建一个helper列来计算每个组。然后,我们过滤groupby中的所有行,其中Col1
df2['cumcount'] = df2.groupby(level=1).cumcount()

dfs = []

for idx, d in df2.groupby(level=1):
    n = d.loc[(d['Col1'] < d['Col2']), 'cumcount'].min()-1
    dfs.append(d.loc[d['cumcount'].le(n)])

df_final = pd.concat(dfs).drop('cumcount', axis=1)

您可以按如下方式执行此操作:

# create a dataframe with a similar structure as yours
data={
'Date': ['2019-04-08', '2019-06-27', '2019-04-05', '2019-05-01', '2019-04-09', '2019-06-19', '2019-04-25', '2019-05-18', '2019-06-10', '2019-05-19', '2019-07-01', '2019-04-07', '2019-03-31', '2019-04-01', '2019-06-09', '2019-04-17', '2019-04-27', '2019-05-27', '2019-06-29', '2019-04-24'],
'Key1': ['B', 'B', 'C', 'A', 'C', 'B', 'A', 'C', 'A', 'C', 'A', 'A', 'C', 'A', 'A', 'B', 'B', 'B', 'A', 'A'],
'Col1': [670, 860, 658, 685, 628, 826, 871, 510, 707, 775, 707, 576, 800, 556, 833, 551, 591, 492, 647, 414],
'Col2': [442, 451, 383, 201, 424, 342, 315, 548, 321, 279, 379, 246, 269, 461, 461, 371, 342, 327, 226, 467],
}

df= pd.DataFrame(data)
df.sort_values(['Date', 'Key1'], ascending=True, inplace=True)
df.set_index(['Date', 'Key1'], inplace=True)

# here the real work starts
# temporarily create a dataframe with the comparison
# which has a simple numeric index to be used later
# to slice the original dataframe
df2= (df['Col1']<df['Col2']).reset_index()

# we only want to see the rows from the first row
# to the last row before a row in which Col1<Col2
all_unwanted= (df2.loc[df2[0] == True, [0]])
if len(all_unwanted) > 0:
    # good there was such a row, so we can use it's index
    # to slice our dataframe
    show_up_to= all_unwanted.idxmin()[0]
else:
    # no, there was no such row, so just display everything
    show_up_to= len(df)
# use the row number to slice our dataframe
df.iloc[0:show_up_to]
输出为:

                 Col1  Col2
Date       Key1            
2019-03-31 C      800   269
2019-04-01 A      556   461
2019-04-05 C      658   383
2019-04-07 A      576   246
2019-04-08 B      670   442
2019-04-09 C      628   424
2019-04-17 B      551   371
--------------------------- <-- cutting off the following lines:
2019-04-24 A      414   467
2019-04-25 A      871   315
2019-04-27 B      591   342
2019-05-01 A      685   201
2019-05-18 C      510   548
2019-05-19 C      775   279
2019-05-27 B      492   327
2019-06-09 A      833   461
2019-06-10 A      707   321
2019-06-19 B      826   342
2019-06-27 B      860   451
2019-06-29 A      647   226
2019-07-01 A      707   379

顺便说一句,如果有人知道如何直接使用.min的结果,避免以后口述和应用,如果你提出改进建议那就太好了:这些列不是索引吗?我更新了我的问题,您的解决方案的问题是,它无法处理Col1值始终大于Col2的情况。我的错误是,我没有在第一时间指定它。如果你能找到解决办法,我很乐意接受你的解决方案。@baccandr好的,很好,我编辑了我的答案,现在我的dict是默认为lendf的defaultdict,所以如果现在Col1

                          Col1  Col2
Date         Range Label            
'2018-08-01' 1     A       900   815
                   B       850   820
             2     A       900   820
# create a dataframe with a similar structure as yours
data={
'Date': ['2019-04-08', '2019-06-27', '2019-04-05', '2019-05-01', '2019-04-09', '2019-06-19', '2019-04-25', '2019-05-18', '2019-06-10', '2019-05-19', '2019-07-01', '2019-04-07', '2019-03-31', '2019-04-01', '2019-06-09', '2019-04-17', '2019-04-27', '2019-05-27', '2019-06-29', '2019-04-24'],
'Key1': ['B', 'B', 'C', 'A', 'C', 'B', 'A', 'C', 'A', 'C', 'A', 'A', 'C', 'A', 'A', 'B', 'B', 'B', 'A', 'A'],
'Col1': [670, 860, 658, 685, 628, 826, 871, 510, 707, 775, 707, 576, 800, 556, 833, 551, 591, 492, 647, 414],
'Col2': [442, 451, 383, 201, 424, 342, 315, 548, 321, 279, 379, 246, 269, 461, 461, 371, 342, 327, 226, 467],
}

df= pd.DataFrame(data)
df.sort_values(['Date', 'Key1'], ascending=True, inplace=True)
df.set_index(['Date', 'Key1'], inplace=True)

# here the real work starts
# temporarily create a dataframe with the comparison
# which has a simple numeric index to be used later
# to slice the original dataframe
df2= (df['Col1']<df['Col2']).reset_index()

# we only want to see the rows from the first row
# to the last row before a row in which Col1<Col2
all_unwanted= (df2.loc[df2[0] == True, [0]])
if len(all_unwanted) > 0:
    # good there was such a row, so we can use it's index
    # to slice our dataframe
    show_up_to= all_unwanted.idxmin()[0]
else:
    # no, there was no such row, so just display everything
    show_up_to= len(df)
# use the row number to slice our dataframe
df.iloc[0:show_up_to]
                 Col1  Col2
Date       Key1            
2019-03-31 C      800   269
2019-04-01 A      556   461
2019-04-05 C      658   383
2019-04-07 A      576   246
2019-04-08 B      670   442
2019-04-09 C      628   424
2019-04-17 B      551   371
--------------------------- <-- cutting off the following lines:
2019-04-24 A      414   467
2019-04-25 A      871   315
2019-04-27 B      591   342
2019-05-01 A      685   201
2019-05-18 C      510   548
2019-05-19 C      775   279
2019-05-27 B      492   327
2019-06-09 A      833   461
2019-06-10 A      707   321
2019-06-19 B      826   342
2019-06-27 B      860   451
2019-06-29 A      647   226
2019-07-01 A      707   379