Python 包含序列中特定值的过滤器df-1_Python_Pandas

Python 包含序列中特定值的过滤器df-1

python pandas

Python 包含序列中特定值的过滤器df-1,python,pandas,Python,Pandas,我有一个对df进行子集划分的复杂过程。对于给定的序列或行，我希望返回特定的值。具体地说，使用下面的df，序列的开始用开始a，开始B，开始C表示。如果X或Y位于给定序列内，我想为同一序列返回Up或Down。如果在包含X或Y的序列中未找到Up或Down，则返回Left或Right。如果未发现Up或Down或X或Y，则打印错误 import pandas as pd df = pd.DataFrame({ 'Num' : [1,2,3,4,6,7,9,10,12,13,14,15

我有一个对df进行子集划分的复杂过程。对于给定的序列或行，我希望返回特定的值。具体地说，使用下面的df，序列的开始用

开始a

，

开始B

，

开始C

表示。如果

或

位于给定序列内，我想为同一序列返回

Up

或

Down

。如果在包含

或

的序列中未找到

Up

或

Down

，则返回

Left

或

Right

。如果未发现

Up

或

Down

或

，则打印错误

import pandas as pd

df = pd.DataFrame({      
    'Num' : [1,2,3,4,6,7,9,10,12,13,14,15,17,18,19,21,22,23,25,26,27,28,30,31,32],
    'Item' : ['Start_A','AB','CD','Left','Start_C','CD','X','Up','Right','Start_C','EF','AB','Y','AB','Down','Left','Start_B','AB','Y','CD','Left','Start_A','AB','CD','Right'],        
    })
  
m1 = df['Item'].isin(['X','Y']).cumsum().gt(0)
m2 = df['Item'].isin(['Up','Down']).iloc[::-1].cumsum().gt(0)

df1 = df[m1 & m2]

原始df：

    Num     Item
0     1  Start_A # No X,Y within sequence. drop all
1     2       AB
2     3       CD
3     4     Left
4     6  Start_C # X and Up within sequence.
5     7       CD
6     9        X
7    10       Up
8    12    Right
9    13  Start_C # Y and Down within sequence.
10   14       EF
11   15       AB
12   17        Y
13   18       AB
14   19     Down
15   21     Left
16   22  Start_B # Y within sequence. No Up/Down. But Left is.
17   23       AB
18   25        Y
19   26       CD
20   27       AB
21   27     Left
22   28  Start_A # No X,Y within sequence. drop all
23   30       AB
24   31       CD
25   32    Right

预期产出：

    Num     Item
6     9        X
7    10       Up
12   17        Y
14   19     Down
18   25        Y
21   27     Left

下面是一种方法：

import pandas as pd
import numpy as np

df = pd.DataFrame({      
    'Num' : [1,2,3,4,6,7,9,10,12,13,14,15,17,18,19,21,22,23,25,26,27,28,30,31,32],
    'Item' : ['Start_A','AB','CD','Left','Start_C','CD','X','Up','Right','Start_C','EF','AB','Y','AB','Down','Left','Start_B','AB','Y','CD','Left','Start_A','AB','CD','Right'],        
    })
  

grp = df['Item'].str.startswith('Start_').cumsum()

df['X_Y'] = df['Item'].isin(['X', 'Y'])
df['Up_Down'] = df['Item'].isin(['Up', 'Down'])
df['Left_Right'] = df['Item'].isin(['Left', 'right'])

def f(x):
    if x['X_Y'].any():
        return pd.concat([x[x['X_Y']], x[x['Up_Down']], x[x['Left_Right']]]).head(2)

df.groupby(grp, group_keys=False).apply(f).drop(['X_Y', 'Up_Down', 'Left_Right'], axis=1)

输出：

    Num  Item
6     9     X
7    10    Up
12   17     Y
14   19  Down
18   25     Y
20   27  Left

详细信息：

首先，使用cumsum和startswith'Start\创建组grp
接下来，创建三个布尔序列，分别标记为“xy”、“Up Down”和“Left” 对
然后，创建一个自定义函数，该函数接受每个组（如果该组包含“X_Y”的真实记录，然后构建数据帧将“X_Y”、“Up_Down”和“Left_Right”按该顺序连接起来。使用头（2）只获得每组的前两条记录
从中生成结果数据帧后删除辅助列群比

在代码中添加注释；希望我能清楚地理解你的挑战：

 # thanks to Scott Boston for a simpler syntax here
(df.assign(counter = df.Item.str.startswith("Start_").cumsum(), 
           boolean = lambda df: df.groupby('counter').transform(",".join), 
           #first phase, X or Y should be present
           # if absent, nulls will be introduced
           boolean_1 = lambda df: df.boolean.str.extract(r"(X|Y)")
           )
   .dropna()
    # next phase, get them in order of Up, Down, Left, Right
    # use extract, since it returns the first match
   .assign(boolean_2 = lambda df: df.boolean
                                    .str.extract(r"(Up|Down|Left|Right)"))
    # filter and keep the original columns
   .query("Item == boolean_1 or Item == boolean_2")
   .filter(['Num', 'Item'])
)


    Num  Item
6     9     X
7    10    Up
12   17     Y
14   19  Down
18   25     Y
20   27  Left

 # thanks to Scott Boston for a simpler syntax here
(df.assign(counter = df.Item.str.startswith("Start_").cumsum(), 
           boolean = lambda df: df.groupby('counter').transform(",".join), 
           #first phase, X or Y should be present
           # if absent, nulls will be introduced
           boolean_1 = lambda df: df.boolean.str.extract(r"(X|Y)")
           )
   .dropna()
    # next phase, get them in order of Up, Down, Left, Right
    # use extract, since it returns the first match
   .assign(boolean_2 = lambda df: df.boolean
                                    .str.extract(r"(Up|Down|Left|Right)"))
    # filter and keep the original columns
   .query("Item == boolean_1 or Item == boolean_2")
   .filter(['Num', 'Item'])
)


    Num  Item
6     9     X
7    10    Up
12   17     Y
14   19  Down
18   25     Y
20   27  Left