Regex 使用正则表达式拆分Pandas中的列
我的第一个问题。。。我有一个熊猫数据框,其中有一列Regex 使用正则表达式拆分Pandas中的列,regex,python-3.x,pandas,split,Regex,Python 3.x,Pandas,Split,我的第一个问题。。。我有一个熊猫数据框,其中有一列“Description”。该列有一个引用和一个名称,我想将其拆分为两列。我在单独的df中有“名称”: # Description # Names --------------------------------------- --------------- 0 A long walk by Miss D'Bus 0
“Description”
。该列有一个引用和一个名称,我想将其拆分为两列。我在单独的df中有“名称”
:
# Description # Names
--------------------------------------- ---------------
0 A long walk by Miss D'Bus 0 Teresa Green
1 A day in the country by Teresa Green 1 Tim Burr
2 Falling Trees by Tim Burr 2 Miss D'Bus
3 Evergreens by Teresa Green
4 Late for Dinner by Miss D'Bus
我已经成功地搜索了描述,通过使用带有所有名称的正则表达式字符串来确定它是否具有匹配的名称:
regex = '$|'.join(map(re.escape, df['Names'])) + '$'
df['Reference'] = df['Description'].str.split(regex, expand=True)
得到
# Description Reference
-----------------------------------------------------------------------
0 A long walk by Miss D'Bus A long walk by
1 A day in the country by Teresa Green A day in the country by
2 Falling Trees by Tim Burr Falling Trees by
3 Evergreens by Teresa Green Evergreens by
4 Late for Dinner by Miss D'Bus Late for Dinner by
但我希望相应的(=删除的分隔符)名称作为附加列
它尝试添加*?像regex一样
我尝试使用“引用”列拆分“说明”列
df['Name'] = df['Description'].str.split(df['Reference'])
我尝试使用“Reference”字符串的长度来分割“Description”列,如
# like: df['Name'] = df['Description'].str[-10:]
df['Name'] = df['Description'].str[-(df['Reference'].str.len()):]
但是我得到一个恒定的切片长度。您可以使用
Series.str.extract
从原始列中获取两种类型的信息:
regex = r'^(.*?)\s*({})$'.format('|'.join(map(re.escape, df['Names'])))
df[['Reference','Name']] = df['Description'].str.extract(regex, expand=True)
输出:
>>> df
Description Reference name
0 A long walk by Miss D'Bus A long walk by Miss D'Bus
1 A day in the country by Teresa Green A day in the country by Teresa Green
2 Falling Trees by Tim Burr Falling Trees by Tim Burr
3 Evergreens by Teresa Green Evergreens by Teresa Green
4 Late for Dinner by Miss D'Bus Late for Dinner by Miss D'Bus
正则表达式看起来像^(.*)\s*(Teresa\Green\Tim\Burr\Miss\D\Bus)$
:
-字符串的开头^
-第1组(“参考”):除换行符以外的任何零个或多个字符,尽可能少(.*)
-0+空格\s*
-第2组(“名称”):具有已知名称的替代组(Teresa\Green | Tim\Burr | Miss\D"Bus)
-字符串结束$