Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 使用正则表达式拆分Pandas中的列_Regex_Python 3.x_Pandas_Split - Fatal编程技术网

Regex 使用正则表达式拆分Pandas中的列

Regex 使用正则表达式拆分Pandas中的列,regex,python-3.x,pandas,split,Regex,Python 3.x,Pandas,Split,我的第一个问题。。。我有一个熊猫数据框,其中有一列“Description”。该列有一个引用和一个名称,我想将其拆分为两列。我在单独的df中有“名称”: # Description # Names --------------------------------------- --------------- 0 A long walk by Miss D'Bus 0

我的第一个问题。。。我有一个熊猫数据框,其中有一列
“Description”
。该列有一个引用和一个名称,我想将其拆分为两列。我在单独的df中有
“名称”

#  Description                                   #  Names
---------------------------------------          ---------------
0  A long walk by Miss D'Bus                     0  Teresa Green
1  A day in the country by Teresa Green          1  Tim Burr
2  Falling Trees by Tim Burr                     2  Miss D'Bus
3  Evergreens by Teresa Green
4  Late for Dinner by Miss D'Bus
我已经成功地搜索了描述,通过使用带有所有名称的正则表达式字符串来确定它是否具有匹配的名称:

regex = '$|'.join(map(re.escape, df['Names'])) + '$' 
df['Reference'] = df['Description'].str.split(regex, expand=True)
得到

#  Description                                   Reference
-----------------------------------------------------------------------
0  A long walk by Miss D'Bus                     A long walk by
1  A day in the country by Teresa Green          A day in the country by
2  Falling Trees by Tim Burr                     Falling Trees by
3  Evergreens by Teresa Green                    Evergreens by
4  Late for Dinner by Miss D'Bus                 Late for Dinner by
但我希望相应的(=删除的分隔符)名称作为附加列

它尝试添加*?像regex一样

我尝试使用“引用”列拆分“说明”列

df['Name'] = df['Description'].str.split(df['Reference'])
我尝试使用“Reference”字符串的长度来分割“Description”列,如

# like: df['Name'] = df['Description'].str[-10:]
df['Name'] = df['Description'].str[-(df['Reference'].str.len()):]

但是我得到一个恒定的切片长度。

您可以使用
Series.str.extract
从原始列中获取两种类型的信息:

regex = r'^(.*?)\s*({})$'.format('|'.join(map(re.escape, df['Names'])))
df[['Reference','Name']] = df['Description'].str.extract(regex, expand=True)
输出:

>>> df
                            Description                Reference          name
0             A long walk by Miss D'Bus           A long walk by    Miss D'Bus
1  A day in the country by Teresa Green  A day in the country by  Teresa Green
2             Falling Trees by Tim Burr         Falling Trees by      Tim Burr
3            Evergreens by Teresa Green            Evergreens by  Teresa Green
4         Late for Dinner by Miss D'Bus       Late for Dinner by    Miss D'Bus
正则表达式看起来像
^(.*)\s*(Teresa\Green\Tim\Burr\Miss\D\Bus)$

  • ^
    -字符串的开头
  • (.*)
    -第1组(“参考”):除换行符以外的任何零个或多个字符,尽可能少
  • \s*
    -0+空格
  • (Teresa\Green | Tim\Burr | Miss\D"Bus)
    -第2组(“名称”):具有已知名称的替代组
  • $
    -字符串结束