Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex PANDAS在字符串列中查找确切的单词和before单词,并在python(PANDAS)列中追加该新列_Regex_Python 3.x_Pandas - Fatal编程技术网

Regex PANDAS在字符串列中查找确切的单词和before单词,并在python(PANDAS)列中追加该新列

Regex PANDAS在字符串列中查找确切的单词和before单词,并在python(PANDAS)列中追加该新列,regex,python-3.x,pandas,Regex,Python 3.x,Pandas,在列a中查找目标单词和前一个单词,并在列b和列c中追加匹配的字符串 This code i have tried to achive this functionality but not able to get the expected output. if any help appreciated Here is the below code i approach with regular expressions: df[''col_b_PY']=df.col_a.str.cont

在列a中查找目标单词和前一个单词,并在列b和列c中追加匹配的字符串

    This code i have tried to achive this functionality but not able to 
get the expected output. if any help appreciated
Here is the below code i approach with regular expressions:

df[''col_b_PY']=df.col_a.str.contains(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+) 
{0,1}PY")

df.col_a.str.extract(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}PY",expand=True)
col_a

Python PY is a general-purpose language LG

Programming language LG in Python PY 

Its easier LG to understand  PY

The syntax of the language LG is clean PY 
数据帧看起来像这样

    This code i have tried to achive this functionality but not able to 
get the expected output. if any help appreciated
Here is the below code i approach with regular expressions:

df[''col_b_PY']=df.col_a.str.contains(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+) 
{0,1}PY")

df.col_a.str.extract(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}PY",expand=True)
col_a

Python PY is a general-purpose language LG

Programming language LG in Python PY 

Its easier LG to understand  PY

The syntax of the language LG is clean PY 
所需输出:

col_a                                       col_b_PY      col_c_LG
Python PY is a general-purpose language LG  Python PY     language LG

Programming language LG in Python PY        Python PY     language LG

Its easier LG to understand  PY            understand PY easier LG

The syntax of the language LG is clean PY   clean  PY     language LG
核对

df['col_c_LG'],df['col_c_PY']=df['col_a'].str.extract(r"(\w+\s+LG)"),df['col_a'].str.extract(r"(\w+\s+PY)")
df
Out[474]: 
                                        col_a       ...              col_c_PY
0  Python PY is a general-purpose language LG       ...             Python PY
1       Programming language LG in Python PY        ...             Python PY
2             Its easier LG to understand  PY       ...        understand  PY
3   The syntax of the language LG is clean PY       ...              clean PY
[4 rows x 3 columns]
你可以用

df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+PY)\b")
df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+LG)\b")
或者,要提取所有匹配项并用空格连接它们:

df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
注意:您需要在正则表达式模式中使用捕获组,以便实际提取文本:

提取正则表达式pat中的捕获组作为数据帧中的列

注意,
\b
单词边界是匹配整个单词所必需的
PY
/
LG

此外,如果您只想从字母开始匹配,您可以将模式修改为

r"([a-zA-Z][a-zA-Z'-]*\s+PY)\b"
r"([a-zA-Z][a-zA-Z'-]*\s+LG)\b"
   ^^^^^^^^          ^
其中,
[a-zA-Z]
将匹配一个字母,
[a-zA-Z'-]*
将匹配0个或多个字母、撇号或连字符

Python 3.7和Pandas 0.24.2:

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)

df = pd.DataFrame({
    'col_a': ['Python PY is a general-purpose language LG',
             'Programming language LG in Python PY',
             'Its easier LG to understand  PY',
             'The syntax of the language LG is clean PY',
             'Python PY is a general purpose PY language LG']
    })
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
输出:

                                           col_a              col_b_PY     col_c_LG
0     Python PY is a general-purpose language LG             Python PY  language LG
1           Programming language LG in Python PY             Python PY  language LG
2                Its easier LG to understand  PY        understand  PY    easier LG
3      The syntax of the language LG is clean PY              clean PY  language LG
4  Python PY is a general purpose PY language LG  Python PY purpose PY  language LG

也许
df['col\u b\u PY']=df['col\u a'].str.extract(r'([a-zA-Z'-]+\s+PY)\b')
df['col\u c\u LG']=df['col\u a'].str.extract(r'([a-zA-Z'-]+\s+LG)\b')
非常感谢@Wiktor Stribizew花了很多时间来寻找答案,我补充了一个解释。请注意,
extract
需要一个捕获组才能真正提取字符串,它只提取捕获的子字符串。
Col\u a
Python PY是一种通用PY语言LG
Col\u a中包含PY是我需要捕获Python PY和purpose PY的两倍。我们的正则表达式仅捕获一次
输出
Python PY purpose PY
确定,使用
extractall
很容易修复,请参阅我的更新答案。非常感谢@Wen Ben you提出了新的解决方案,整个答案
Col_a
Python PY是一种通用PY语言LG
在colu a中包含PY是两倍我需要捕获Python PY和purpose PY我们的正则表达式仅捕获一次
输出
Python PY purpose PY
更改单引号到双引号?