Python 是否有用于选择指定字符串后的前两个单词的函数?

Python 是否有用于选择指定字符串后的前两个单词的函数?,python,python-3.x,pandas,Python,Python 3.x,Pandas,我试图在我的数据集中选择字符串“POS购买”后的前两个单词 df: ID transaction_description 1 POS PURCHASE MR PRICE WHK FAC 2 WITHDRAWAL FEE 3 POS PURCHASE KFC WERNHIL STATE 4 REJECTED ATM TRANSACTION 5

我试图在我的数据集中选择字符串“POS购买”后的前两个单词

df:   
    ID        transaction_description
     1         POS PURCHASE MR PRICE WHK FAC
     2         WITHDRAWAL FEE
     3         POS PURCHASE KFC WERNHIL STATE
     4         REJECTED ATM TRANSACTION
     5         ATM CASH WITHDRAWAL
     6         POS PURCHASE EDGARS GROVE
这是我的数据集

df:   
    ID        transaction_description
     1         POS PURCHASE MR PRICE WHK FAC
     2         WITHDRAWAL FEE
     3         POS PURCHASE KFC WERNHIL STATE
     4         REJECTED ATM TRANSACTION
     5         ATM CASH WITHDRAWAL
     6         POS PURCHASE EDGARS GROVE
我希望我的输出是这样的:

dfnew:
    ID       transaction_description                 TRANX
     1       POS PURCHASE MR PRICE WHK FAC          MR PRICE
     2       WITHDRAWAL FEE                         WITHDRAWAL FEE
     3       POS PURCHASE KFC WERNHIL STATE         KFC WERNHIL
     4    REJECTED ATM TRANSACTION               REJECTED ATM TRANSACTION
     5         ATM CASH WITHDRAWAL                   ATM CASH WITHDRAWAL  
     6         POS PURCHASE EDGARS GROVE MALL        EDGARS GROVE
我尝试使用此代码,但无法创建包含所需输出的新列

code:

   for value in df['transaction_description'].values:
       non_data = re.split('POS PURCHASE |POS PURCHASE ',value)
       terms_list = [term for term in non_data if len(term) > 0] 
       substrs = [term.split()[0:1] for term in terms_list] 
       result = [' '.join(term) for term in substrs] 
   print (result)

这是一种使用regex的方法

Ex:

import re

df = pd.DataFrame({"transaction_description": ['POS PURCHASE MR PRICE WHK FAC', 'WITHDRAWAL FEE', 'POS PURCHASE KFC WERNHIL STATE', 'REJECTED ATM TRANSACTION', 'ATM CASH WITHDRAWAL', 'POS PURCHASE EDGARS GROVE']})
df["TRANX"] = df["transaction_description"].apply(lambda x: re.search(r"POS PURCHASE (\w+\s+\w+)", x).group(1) if "POS PURCHASE" in x else x)
print(df)
          transaction_description                     TRANX
0   POS PURCHASE MR PRICE WHK FAC                  MR PRICE
1                  WITHDRAWAL FEE            WITHDRAWAL FEE
2  POS PURCHASE KFC WERNHIL STATE               KFC WERNHIL
3        REJECTED ATM TRANSACTION  REJECTED ATM TRANSACTION
4             ATM CASH WITHDRAWAL       ATM CASH WITHDRAWAL
5       POS PURCHASE EDGARS GROVE              EDGARS GROVE
输出:

import re

df = pd.DataFrame({"transaction_description": ['POS PURCHASE MR PRICE WHK FAC', 'WITHDRAWAL FEE', 'POS PURCHASE KFC WERNHIL STATE', 'REJECTED ATM TRANSACTION', 'ATM CASH WITHDRAWAL', 'POS PURCHASE EDGARS GROVE']})
df["TRANX"] = df["transaction_description"].apply(lambda x: re.search(r"POS PURCHASE (\w+\s+\w+)", x).group(1) if "POS PURCHASE" in x else x)
print(df)
          transaction_description                     TRANX
0   POS PURCHASE MR PRICE WHK FAC                  MR PRICE
1                  WITHDRAWAL FEE            WITHDRAWAL FEE
2  POS PURCHASE KFC WERNHIL STATE               KFC WERNHIL
3        REJECTED ATM TRANSACTION  REJECTED ATM TRANSACTION
4             ATM CASH WITHDRAWAL       ATM CASH WITHDRAWAL
5       POS PURCHASE EDGARS GROVE              EDGARS GROVE

编辑--使用
str.extract

df = pd.DataFrame({"transaction_description": ['POS PURCHASE MR PRICE WHK FAC', 'WITHDRAWAL FEE', 'POS PURCHASE KFC WERNHIL STATE', 'REJECTED ATM TRANSACTION', 'ATM CASH WITHDRAWAL', 'POS PURCHASE EDGARS GROVE']})
df["TRANX"] = df["transaction_description"].str.extract(r"POS PURCHASE (\w+\s+\w+)")
df["TRANX"].fillna(df["transaction_description"], inplace=True)
print(df)

如果POS购买总是在开始时进行,就像示例数据中的情况一样,您可以将其删除

df['TRANX'] = df['transaction_description'].str.replace('POS PURCHASE ', '')

我在运行代码时出错。。。。“AttributeError:'NoneType'对象没有属性'group'”看起来在
POS购买之后有一些行没有数据