Python 如何根据多个条件拆分带字符串的dataframe列
我有一个熊猫数据框,如下所示:Python 如何根据多个条件拆分带字符串的dataframe列,python,regex,pandas,dataframe,Python,Regex,Pandas,Dataframe,我有一个熊猫数据框,如下所示: ID Col.A 28654 This is a dark chocolate which is sweet 39876 Sky is blue 1234 Sky is cloudy 3423 88776 Stars can be seen in the dark sky 35491 Schools are closed 4568 but shops are open 我试图在单词暗或数字之前拆分C
ID Col.A
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
我试图在单词暗
或数字
之前拆分Col.A
。我期望的结果如下所示
ID Col.A Col.B
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
我尝试将包含单词dark
的行分组到一个数据帧中,并将带有数字的行分组到另一个数据帧中,然后相应地拆分它们。之后,我可以连接生成的数据帧以获得预期的结果。代码如下所示:
df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet',
'Sky is blue 1234 Sky is cloudy 3423',
'Stars can be seen in the dark sky',
'Schools are closed 4568 but shops are open']})
df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)
所得结果与预期结果不同。就是
0 1
0 This is a chocolate which is sweet
2 Stars can be seen in the sky
1 Sky is blue Sky is cloudy
3 Schools are closed but shops are open
我错过了字符串中的数字和结果中的单词dark
那么,如何解决这个问题,并在不丢失拆分的单词和数字的情况下得到结果呢
有没有办法“在期望的单词或数字之前切片”而不删除它们
df[["Col.A", "Col.B"]] = df["Col.A"].str.split(
r"\s*(dark.*|\d.*)", n=1, expand=True
)[[0, 1]]
print(df)
印刷品:
ID列A列B列
这是一种甜的黑巧克力
139876天是蓝色的1234天是多云的3423
在黑暗的天空中可以看到288776颗星星
335491所学校停课4568所商店开门营业
Series.str.split
正则表达式详细信息:
:一次或多次匹配任何空白字符\s+
:正向前瞻(?=\b(?:暗|\d+)\b)
:防止部分匹配的单词边界\b
:非捕获组(?:暗|\d+)
:第一个备选方案按字面意思匹配深色字符dark
:第二个备选方案,与任何数字匹配一次或多次\d+
:防止部分匹配的单词边界\b
请查看在线,查看您所展示的样本,请尝试以下内容。利用熊猫的功能。简单的解释是使用extract函数和regex创建第一个捕获组,使用非贪婪匹配,第二个捕获组有数字或暗字符串,直到行的最后一行,并将其保存到A列和B列中
df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df
对于显示的样本,输出将如下所示:
ID Col.A Col.B
0 28654 This is a dark chocolate which is sweet
1 39876 Sky is blue 1234 Sky is cloudy 3423
2 88776 Stars can be seen in the dark sky
3 35491 Schools are closed 4568 but shops are open
那很酷。如果我在同一行中有
dark
和dark
,并且我只需要在dark
之前拆分,该怎么办?有什么办法吗?@Athullt是的,我们可以这样做。我已经编辑了答案。
df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df
ID Col.A Col.B
0 28654 This is a dark chocolate which is sweet
1 39876 Sky is blue 1234 Sky is cloudy 3423
2 88776 Stars can be seen in the dark sky
3 35491 Schools are closed 4568 but shops are open