python使用空格将字符串列一分为二

python使用空格将字符串列一分为二,python,pandas,replace,extract,Python,Pandas,Replace,Extract,我有一个python dataframe df,其中包含以下列“title”: 我需要把这个专栏分成两部分,实际标题和最后的不规则代码。有没有办法用空格后面的最后一个字来分隔它? 请注意,最后一个标题没有代码,123是标题的一部分 最终目标DF title | cleaned title | code This is the first title XY2547 This is the first tit

我有一个python dataframe df,其中包含以下列“title”:

我需要把这个专栏分成两部分,实际标题和最后的不规则代码。有没有办法用空格后面的最后一个字来分隔它? 请注意,最后一个标题没有代码,123是标题的一部分

最终目标DF

title                             |  cleaned title            | code
This is the first title XY2547       This is the first title    XY2547
This is the second title WWW48921    This is the second title   WWW48921
This is the third title  A2438999    This is the third title    A2438999
This is another title 123            This is another title 123
我在想类似的事情

df['code'] = df.title.str.extract(r'_\s(\w)', expand=False)
这不管用

谢谢

试试这个:

In [62]: df
Out[62]:
                               title
0     This is the first title XY2547
1  This is the second title WWW48921
2  This is the third title  A2438999
3         This is another title 123

In [63]: df[['cleaned_title', 'code']] = \
    ...:     df.title.str.extract(r'(.*?)\s+([A-Z]{1,}\d{3,})?$', expand=True)

In [64]: df
Out[64]:
                               title              cleaned_title      code
0     This is the first title XY2547    This is the first title    XY2547
1  This is the second title WWW48921   This is the second title  WWW48921
2  This is the third title  A2438999    This is the third title  A2438999
3         This is another title 123   This is another title 123       NaN
解决方案
#1
可以在这里使用。它从字符串的右侧开始拆分
n

然后,我们可以使用
df

df.join(
    df.title.str.rsplit(n=1, expand=True).rename(
        columns={0: 'cleaned title', 1: 'code'}
    )
)

                               title             cleaned title      code
0     This is the first title XY2547   This is the first title    XY2547
1  This is the second title WWW48921  This is the second title  WWW48921
2  This is the third title  A2438999   This is the third title  A2438999
3         This is another title 123      This is another title       123
解决方案
#2
为避免将
123
解释为代码,必须应用一些未提供的附加逻辑@马克斯很有礼貌地将他的逻辑嵌入正则表达式中

我的
regex
解决方案如下所示。
计划

  • 使用
    “?P”
    命名生成的列
  • 仅将大写字母和任何数字与
    '[A-Z0-9]'匹配
  • 确保有4个或更多带有
    '{4,}'
  • 从开始的
    “^”
    到结束的
    “$”
  • 确保
    '.'.'
    没有贪婪地使用
    '.'.'.'


你介意更详细地解释这部分r'(.*)\s+(\w+\d+)吗?这些是正则表达式,对吗?我正在用谷歌搜索它们…@jeangelj,如果你想你可以使用在线正则表达式解释程序,比如@jeangelj,很高兴我能帮上忙:)使用这个不可思议的正则表达式解释程序,我理解正确吗;我正在提取一个基于正则表达式(即r)的字符串-所有内容都在一个空格之后,最后(即(.?)\s和$)包含一个字母和至少3个数字?@jeangelj它的1个或多个字母后跟3个或多个数字。使用这些知识来调整它,使其完全符合您的需要。谢谢-在上面的示例中,它确实使“123”在单独的列中成为一个值,即使它是标题的一部分,因为代码缺失,但我想这必须是一个手动清理。有没有一种方法可以指定,如果它包含数字,它应该只将最小的单词分离出来?非常感谢-我刚刚对这个答案投了赞成票,现在正在测试它;也非常感谢你的正则表达式explanation@jeangelj没问题,很高兴我能帮上忙。一如既往,帮不上忙!谢谢你,谢谢你;我总是向上投票@piRSquared-stackoverflow上最好最快的答案!
df.join(
    df.title.str.rsplit(n=1, expand=True).rename(
        columns={0: 'cleaned title', 1: 'code'}
    )
)

                               title             cleaned title      code
0     This is the first title XY2547   This is the first title    XY2547
1  This is the second title WWW48921  This is the second title  WWW48921
2  This is the third title  A2438999   This is the third title  A2438999
3         This is another title 123      This is another title       123
regex = '^(?P<cleaned_title>.*?)\s*(?P<code>[A-Z0-9]{4,})?$'
df.join(df.title.str.extract(regex, expand=True))

                               title              cleaned_title      code
0     This is the first title XY2547    This is the first title    XY2547
1  This is the second title WWW48921   This is the second title  WWW48921
2  This is the third title  A2438999    This is the third title  A2438999
3          This is another title 123  This is another title 123       NaN