从Python数据帧的条件执行操作

从Python数据帧的条件执行操作,python,regex,pandas,if-statement,Python,Regex,Pandas,If Statement,我有以下数据帧: data = {'example': ['ACETATO MOLOCUATO', 'WORD1 WORD2 WORD3 WORD4_ATO', 'WORD1 WORD2 WORD3 WORD4']} df = pd.DataFrame(data) print(df) example 0 ACETATO MOLOCUATO 1 WORD1 WORD2 WORD3 WORD4_ATO 2

我有以下数据帧:

data = {'example': ['ACETATO MOLOCUATO',
  'WORD1 WORD2 WORD3 WORD4_ATO',
  'WORD1 WORD2 WORD3 WORD4']}

df = pd.DataFrame(data)

print(df)

                       example
0            ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO
2      WORD1 WORD2 WORD3 WORD4
df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))
print(df)


                       example    Word1      Word2  Word3
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

条件是,如果字符串以“ATO”结尾,我将只选择前三个单词。我种植它的方式如下:

data = {'example': ['ACETATO MOLOCUATO',
  'WORD1 WORD2 WORD3 WORD4_ATO',
  'WORD1 WORD2 WORD3 WORD4']}

df = pd.DataFrame(data)

print(df)

                       example
0            ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO
2      WORD1 WORD2 WORD3 WORD4
df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))
print(df)


                       example    Word1      Word2  Word3
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

#如果示例以“ATO”结尾

#然后提取前3个单词

df ['example']. str.extract ('(\ w +) \ s (\ w +) \ s (\ w +)')
#否则,打印数据帧时不作任何更改

df ['example']. str.extract ('(\ w +) \ s (\ w +) \ s (\ w +)')
我想知道根据第一个条件关联最后一个操作的最佳方法。

输入数据:

>>> df
                      example
0  W1 W2 W3 ACETATO MOLOCUATO
1          NOTHING TO DO HERE
筛选并应用:

mask = df["example"].str.endswith("ATO")  # condition
df.loc[mask, "example"] = df.loc[mask, "example"] \
                            .apply(lambda s:  " ".join(s.split(" ")[:3]))
输出结果:

>>> df
              example
0            W1 W2 W3
1  NOTHING TO DO HERE

一种方法如下:

df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))
(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?
正则表达式功能:

data = {'example': ['ACETATO MOLOCUATO',
  'WORD1 WORD2 WORD3 WORD4_ATO',
  'WORD1 WORD2 WORD3 WORD4']}

df = pd.DataFrame(data)

print(df)

                       example
0            ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO
2      WORD1 WORD2 WORD3 WORD4
df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))
print(df)


                       example    Word1      Word2  Word3
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

  • 它允许提取1、2或3个单词(无需全部提取3个单词)。请注意,如果只有1或2个单词,正则表达式将不返回匹配的内容
  • 命名捕获组用于格式化3个新列的名称
  • 演示 数据设置:

    data = {'example': ['ACETATO MOLOCUATO',
      'WORD1 WORD2 WORD3 WORD4_ATO',
      'WORD1 WORD2 WORD3 WORD4']}
    
    df = pd.DataFrame(data)
    
    print(df)
    
                           example
    0            ACETATO MOLOCUATO
    1  WORD1 WORD2 WORD3 WORD4_ATO
    2      WORD1 WORD2 WORD3 WORD4
    
    df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))
    
    print(df)
    
    
                           example    Word1      Word2  Word3
    0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
    1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
    2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN
    
    
    运行新代码:

    data = {'example': ['ACETATO MOLOCUATO',
      'WORD1 WORD2 WORD3 WORD4_ATO',
      'WORD1 WORD2 WORD3 WORD4']}
    
    df = pd.DataFrame(data)
    
    print(df)
    
                           example
    0            ACETATO MOLOCUATO
    1  WORD1 WORD2 WORD3 WORD4_ATO
    2      WORD1 WORD2 WORD3 WORD4
    
    df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))
    
    print(df)
    
    
                           example    Word1      Word2  Word3
    0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
    1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
    2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN
    
    
    如果要将结果放在一列中,可以执行以下操作: 结果:

    print(df)
    
    
                           example    Word1      Word2  Word3            3_words
    0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN  ACETATO MOLOCUATO
    1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3  WORD1 WORD2 WORD3
    2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN                   
    
    
    您可以与可选的捕获组一起使用,但只需调用
    Series.str.extract

    将熊猫作为pd导入
    df=pd.DataFrame({'example':['ACETATO MOLOCUATO','WORD1 WORD2 WORD3 WORD4_ATO','WORD1 WORD2 WORD3 WORD4']})
    df['Word_1','Word_2','Word_3']]=df['example'].str.extract(r'^(?=.*ATO$)(\w+)(\w+)(:\s+(\w+))(?:\s+(\w+))
    #>>>df
    #示例单词\u 1单词\u 2单词\u 3
    #0至MOLOCUATO ACETATO至MOLOCUATO NaN
    #1个单词1个单词2个单词3个单词4\u ATO单词1个单词2个单词3
    #2字1字2字3字4楠楠楠楠楠
    

    详情:

    • ^
      -字符串的开头
    • (?=.*ATO$)
      -字符串必须以
      ATO
    • (\w+)
      -第1组:一个或多个单词字符
    • (?:\s+(\w+))
      -一个可选的非捕获组,匹配一个或多个空格,然后将任何一个或多个单词字符捕获到组2中
    • (?:\s+(\w+))
      -一个可选的非捕获组,匹配一个或多个空格,然后将任何一个或多个单词字符捕获到组3中

    您需要以什么格式提取这3个单词?放在单独的列中或全部放在一列中(例如,作为单词列表或用逗号分隔)?请告知您的预期产量。另外,请修改您的示例数据,使其包含3个或更多单词的大小写。