Regex PANDAS在字符串列中查找确切的单词和before单词，并在python（PANDAS）列中追加该新列_Regex_Python 3.x_Pandas

Regex PANDAS在字符串列中查找确切的单词和before单词，并在python（PANDAS）列中追加该新列

regex python-3.x pandas

Regex PANDAS在字符串列中查找确切的单词和before单词，并在python（PANDAS）列中追加该新列,regex,python-3.x,pandas,Regex,Python 3.x,Pandas,在列a中查找目标单词和前一个单词，并在列b和列c中追加匹配的字符串 This code i have tried to achive this functionality but not able to get the expected output. if any help appreciated Here is the below code i approach with regular expressions: df[''col_b_PY']=df.col_a.str.cont

在列a中查找目标单词和前一个单词，并在列b和列c中追加匹配的字符串

    This code i have tried to achive this functionality but not able to 
get the expected output. if any help appreciated
Here is the below code i approach with regular expressions:

df[''col_b_PY']=df.col_a.str.contains(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+) 
{0,1}PY")

df.col_a.str.extract(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}PY",expand=True)

col_a

Python PY is a general-purpose language LG

Programming language LG in Python PY 

Its easier LG to understand  PY

The syntax of the language LG is clean PY

数据帧看起来像这样

This code i have tried to achive this functionality but not able to get the expected output. if any help appreciated Here is the below code i approach with regular expressions: df[''col_b_PY']=df.col_a.str.contains(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+) {0,1}PY") df.col_a.str.extract(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}PY",expand=True)

col_a Python PY is a general-purpose language LG Programming language LG in Python PY Its easier LG to understand PY The syntax of the language LG is clean PY
所需输出：

col_a col_b_PY col_c_LG Python PY is a general-purpose language LG Python PY language LG Programming language LG in Python PY Python PY language LG Its easier LG to understand PY understand PY easier LG The syntax of the language LG is clean PY clean PY language LG
核对

df['col_c_LG'],df['col_c_PY']=df['col_a'].str.extract(r"(\w+\s+LG)"),df['col_a'].str.extract(r"(\w+\s+PY)") df Out[474]: col_a ... col_c_PY 0 Python PY is a general-purpose language LG ... Python PY 1 Programming language LG in Python PY ... Python PY 2 Its easier LG to understand PY ... understand PY 3 The syntax of the language LG is clean PY ... clean PY [4 rows x 3 columns]
你可以用

df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+PY)\b") df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+LG)\b")
或者，要提取所有匹配项并用空格连接它们：

df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1) df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
注意：您需要在正则表达式模式中使用捕获组，以便实际提取文本：
提取正则表达式pat中的捕获组作为数据帧中的列
注意，
\b
单词边界是匹配整个单词所必需的
PY
/
LG
此外，如果您只想从字母开始匹配，您可以将模式修改为

r"([a-zA-Z][a-zA-Z'-]*\s+PY)\b" r"([a-zA-Z][a-zA-Z'-]*\s+LG)\b" ^^^^^^^^ ^
其中，
[a-zA-Z]
将匹配一个字母，
[a-zA-Z'-]*
将匹配0个或多个字母、撇号或连字符
Python 3.7和Pandas 0.24.2：

pd.set_option('display.width', 1000) pd.set_option('display.max_columns', 500) df = pd.DataFrame({ 'col_a': ['Python PY is a general-purpose language LG', 'Programming language LG in Python PY', 'Its easier LG to understand PY', 'The syntax of the language LG is clean PY', 'Python PY is a general purpose PY language LG'] }) df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1) df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
输出：

col_a col_b_PY col_c_LG 0 Python PY is a general-purpose language LG Python PY language LG 1 Programming language LG in Python PY Python PY language LG 2 Its easier LG to understand PY understand PY easier LG 3 The syntax of the language LG is clean PY clean PY language LG 4 Python PY is a general purpose PY language LG Python PY purpose PY language LG

也许
df['col\u b\u PY']=df['col\u a'].str.extract（r'（[a-zA-Z'-]+\s+PY）\b'）
和
df['col\u c\u LG']=df['col\u a'].str.extract（r'（[a-zA-Z'-]+\s+LG）\b'）
非常感谢@Wiktor Stribizew花了很多时间来寻找答案，我补充了一个解释。请注意，
extract
需要一个捕获组才能真正提取字符串，它只提取捕获的子字符串。
Col\u a
Python PY是一种通用PY语言LG
Col\u a中包含PY是我需要捕获Python PY和purpose PY的两倍。我们的正则表达式仅捕获一次
输出Python PY purpose PY 确定，使用extractall 很容易修复，请参阅我的更新答案。非常感谢@Wen Ben you提出了新的解决方案，整个答案Col_a Python PY是一种通用PY语言LG 在colu a中包含PY是两倍我需要捕获Python PY和purpose PY我们的正则表达式仅捕获一次输出Python PY purpose PY 更改单引号到双引号？