从Python数据帧的条件执行操作_Python_Regex_Pandas_If Statement

从Python数据帧的条件执行操作

python regex pandas if-statement

从Python数据帧的条件执行操作,python,regex,pandas,if-statement,Python,Regex,Pandas,If Statement,我有以下数据帧： data = {'example': ['ACETATO MOLOCUATO', 'WORD1 WORD2 WORD3 WORD4_ATO', 'WORD1 WORD2 WORD3 WORD4']} df = pd.DataFrame(data) print(df) example 0 ACETATO MOLOCUATO 1 WORD1 WORD2 WORD3 WORD4_ATO 2

我有以下数据帧：

data = {'example': ['ACETATO MOLOCUATO',
  'WORD1 WORD2 WORD3 WORD4_ATO',
  'WORD1 WORD2 WORD3 WORD4']}

df = pd.DataFrame(data)

print(df)

                       example
0            ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO
2      WORD1 WORD2 WORD3 WORD4

df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))

print(df)


                       example    Word1      Word2  Word3
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

条件是，如果字符串以“ATO”结尾，我将只选择前三个单词。我种植它的方式如下：

data = {'example': ['ACETATO MOLOCUATO',
  'WORD1 WORD2 WORD3 WORD4_ATO',
  'WORD1 WORD2 WORD3 WORD4']}

df = pd.DataFrame(data)

print(df)

                       example
0            ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO
2      WORD1 WORD2 WORD3 WORD4

df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))

print(df)


                       example    Word1      Word2  Word3
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

#如果示例以“ATO”结尾

#然后提取前3个单词

df ['example']. str.extract ('(\ w +) \ s (\ w +) \ s (\ w +)')

#否则，打印数据帧时不作任何更改

df ['example']. str.extract ('(\ w +) \ s (\ w +) \ s (\ w +)')

我想知道根据第一个条件关联最后一个操作的最佳方法。

输入数据：

>>> df
                      example
0  W1 W2 W3 ACETATO MOLOCUATO
1          NOTHING TO DO HERE

筛选并应用：

mask = df["example"].str.endswith("ATO")  # condition
df.loc[mask, "example"] = df.loc[mask, "example"] \
                            .apply(lambda s:  " ".join(s.split(" ")[:3]))

输出结果：

>>> df
              example
0            W1 W2 W3
1  NOTHING TO DO HERE

一种方法如下：

df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))

(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?

正则表达式功能：

data = {'example': ['ACETATO MOLOCUATO',
  'WORD1 WORD2 WORD3 WORD4_ATO',
  'WORD1 WORD2 WORD3 WORD4']}

df = pd.DataFrame(data)

print(df)

                       example
0            ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO
2      WORD1 WORD2 WORD3 WORD4

df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))

print(df)


                       example    Word1      Word2  Word3
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

它允许提取1、2或3个单词（无需全部提取3个单词）。请注意，如果只有1或2个单词，正则表达式将不返回匹配的内容

命名捕获组用于格式化3个新列的名称

演示 数据设置：

data = {'example': ['ACETATO MOLOCUATO',
  'WORD1 WORD2 WORD3 WORD4_ATO',
  'WORD1 WORD2 WORD3 WORD4']}

df = pd.DataFrame(data)

print(df)

                       example
0            ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO
2      WORD1 WORD2 WORD3 WORD4

df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))

print(df)


                       example    Word1      Word2  Word3
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

运行新代码：

data = {'example': ['ACETATO MOLOCUATO',
  'WORD1 WORD2 WORD3 WORD4_ATO',
  'WORD1 WORD2 WORD3 WORD4']}

df = pd.DataFrame(data)

print(df)

                       example
0            ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO
2      WORD1 WORD2 WORD3 WORD4

df = df.join(df.loc[df ['example'].str.endswith('ATO'), 'example'].str.extract(r'(?P<Word1>\w+)(?:\s+(?P<Word2>\w+))?(?:\s+(?P<Word3>\w+))?'))

print(df)


                       example    Word1      Word2  Word3
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

如果要将结果放在一列中，可以执行以下操作：结果:

print(df)


                       example    Word1      Word2  Word3            3_words
0            ACETATO MOLOCUATO  ACETATO  MOLOCUATO    NaN  ACETATO MOLOCUATO
1  WORD1 WORD2 WORD3 WORD4_ATO    WORD1      WORD2  WORD3  WORD1 WORD2 WORD3
2      WORD1 WORD2 WORD3 WORD4      NaN        NaN    NaN

您可以与可选的捕获组一起使用，但只需调用

Series.str.extract

：

将熊猫作为pd导入
df=pd.DataFrame（{'example'：['ACETATO MOLOCUATO'，'WORD1 WORD2 WORD3 WORD4_ATO'，'WORD1 WORD2 WORD3 WORD4']}）
df['Word_1'，'Word_2'，'Word_3']]=df['example'].str.extract（r'^（？=.*ATO$）（\w+）（\w+）（：\s+（\w+））（？：\s+（\w+））
#>>>df
#示例单词\u 1单词\u 2单词\u 3
#0至MOLOCUATO ACETATO至MOLOCUATO NaN
#1个单词1个单词2个单词3个单词4\u ATO单词1个单词2个单词3
#2字1字2字3字4楠楠楠楠楠

看

详情：

```
^
```
-字符串的开头
```
（？=.*ATO$）
```
-字符串必须以
```
ATO
```
```
（\w+）
```
-第1组：一个或多个单词字符
```
（？：\s+（\w+））
```
-一个可选的非捕获组，匹配一个或多个空格，然后将任何一个或多个单词字符捕获到组2中
```
（？：\s+（\w+））
```
-一个可选的非捕获组，匹配一个或多个空格，然后将任何一个或多个单词字符捕获到组3中

您需要以什么格式提取这3个单词？放在单独的列中或全部放在一列中（例如，作为单词列表或用逗号分隔）？请告知您的预期产量。另外，请修改您的示例数据，使其包含3个或更多单词的大小写。