Python 如何根据多个条件拆分带字符串的dataframe列

Python 如何根据多个条件拆分带字符串的dataframe列,python,regex,pandas,dataframe,Python,Regex,Pandas,Dataframe,我有一个熊猫数据框,如下所示: ID Col.A 28654 This is a dark chocolate which is sweet 39876 Sky is blue 1234 Sky is cloudy 3423 88776 Stars can be seen in the dark sky 35491 Schools are closed 4568 but shops are open 我试图在单词暗或数字之前拆分C

我有一个熊猫数据框,如下所示:

    ID       Col.A

28654      This is a dark chocolate which is sweet 
39876      Sky is blue 1234 Sky is cloudy 3423
88776      Stars can be seen in the dark sky
35491      Schools are closed 4568 but shops are open
我试图在单词
数字
之前拆分
Col.A
。我期望的结果如下所示

     ID             Col.A                             Col.B
    
    28654      This is a                  dark chocolate which is sweet 
    39876      Sky is blue                1234 Sky is cloudy 3423
    88776      Stars can be seen in the   dark sky
    35491      Schools are closed         4568 but shops are open
我尝试将包含单词
dark
的行分组到一个数据帧中,并将带有数字的行分组到另一个数据帧中,然后相应地拆分它们。之后,我可以连接生成的数据帧以获得预期的结果。代码如下所示:

df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet', 
                                                            'Sky is blue 1234 Sky is cloudy 3423', 
                                                            'Stars can be seen in the dark sky',
                                                            'Schools are closed 4568 but shops are open']})

df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)
所得结果与预期结果不同。就是

      0                              1
0   This is a                   chocolate which is sweet
2   Stars can be seen in the     sky    
1   Sky is blue                  Sky is cloudy  
3   Schools are closed           but shops are open
我错过了字符串中的数字和结果中的单词
dark

那么,如何解决这个问题,并在不丢失拆分的单词和数字的情况下得到结果呢

有没有办法“在期望的单词或数字之前切片”而不删除它们

df[["Col.A", "Col.B"]] = df["Col.A"].str.split(
    r"\s*(dark.*|\d.*)", n=1, expand=True
)[[0, 1]]
print(df)
印刷品:

ID列A列B列
这是一种甜的黑巧克力
139876天是蓝色的1234天是多云的3423
在黑暗的天空中可以看到288776颗星星
335491所学校停课4568所商店开门营业
Series.str.split

正则表达式详细信息:

  • \s+
    :一次或多次匹配任何空白字符
  • (?=\b(?:暗|\d+)\b)
    :正向前瞻
    • \b
      :防止部分匹配的单词边界
    • (?:暗|\d+)
      :非捕获组
      • dark
        :第一个备选方案按字面意思匹配深色字符
      • \d+
        :第二个备选方案,与任何数字匹配一次或多次
    • \b
      :防止部分匹配的单词边界

请查看在线

,查看您所展示的样本,请尝试以下内容。利用熊猫的功能。简单的解释是使用extract函数和regex创建第一个捕获组,使用非贪婪匹配,第二个捕获组有数字或暗字符串,直到行的最后一行,并将其保存到A列和B列中

df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df
对于显示的样本,输出将如下所示:

    ID      Col.A                       Col.B
0   28654   This is a                   dark chocolate which is sweet
1   39876   Sky is blue                 1234 Sky is cloudy 3423
2   88776   Stars can be seen in the    dark sky
3   35491   Schools are closed          4568 but shops are open

那很酷。如果我在同一行中有
dark
dark
,并且我只需要在
dark
之前拆分,该怎么办?有什么办法吗?@Athullt是的,我们可以这样做。我已经编辑了答案。
df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df
    ID      Col.A                       Col.B
0   28654   This is a                   dark chocolate which is sweet
1   39876   Sky is blue                 1234 Sky is cloudy 3423
2   88776   Stars can be seen in the    dark sky
3   35491   Schools are closed          4568 but shops are open