Python 从文本列中提取两列定量。
我有一个数据帧:Python 从文本列中提取两列定量。,python,pandas,Python,Pandas,我有一个数据帧: df = pd.DataFrame({"id": [1,2,3,4,5], "text": ["This is a ratio of 13.4/10","Favorate rate of this id is 11/9","It may not be a good looking person. But he is vary popular (15/10)","Ratio is 12/10","very popular 17/10"],
df = pd.DataFrame({"id": [1,2,3,4,5],
"text": ["This is a ratio of 13.4/10","Favorate rate of this id is 11/9","It may not be a good looking person. But he is vary popular (15/10)","Ratio is 12/10","very popular 17/10"],
"name":["Joe","Adam","Sara","Jose","Bob"]})
我想把这些数字分成两列,得出以下结果:
df = pd.DataFrame({"id": [1,2,3,4,5],
"text": ["This is a ratio of 13.4/10","Favorate rate of this id is 11/9","It may not be a good looking person. But he is vary popular (15/10)","Ratio is 12/10","very popular 17/10"],
"name":["Joe","Adam","Sara","Jose","Bob"],
"rating_nominator":[13.4,11,15,12,17],
"rating_denominator":[10,9,10,10,10]})
非常感谢您的帮助。您希望匹配的一般模式是
(一些号码)/(其他号码)
。匹配浮点数并不是一项简单的任务,上面有很多答案,所以可以回答这个问题,所以你可以在这里利用它
一个相当健壮的表达式,改编自is([+-]?(?:[0-9]*[.])?[0-9]+)
。您可以将其与和f字符串一起使用:
fpr = r'([+-]?(?:[0-9]*[.])?[0-9]+)'
res = df.text.str.extract(fr'{fpr}\/{fpr}').astype(float)
要将其分配给数据帧,请执行以下操作:
df[['rating_nominator', 'rating_denominator']] = res
你可以用
df[['rating_nominator', 'rating_denominator']] = df['text'].str.extract('(-?\d+(?:\.\d+)?)/(-?\d+(?:\.\d+)?)').astype(float)
正则表达式(?\d+(?:\.\d+)/(?\d+(?:\.\d+))
将捕获整数或浮点数作为命名符或分母
(edit:中的正则表达式涵盖了更多的情况。我做了一些假设,例如,在数字中找不到一元+
符号。)
演示:
id text name rating_nominator rating_denominator
0 1 This is a ratio of 13.4/10 Joe 13.4 10.0
1 2 Favorate rate of this id is 11/9 Adam 11.0 9.0
2 3 It may not be a good looking person. But he is... Sara 15.0 10.0
3 4 Ratio is 12/10 Jose 12.0 10.0
4 5 very popular 17/10 Bob 17.0 10.0
df[['rating_nominator', 'rating_denominator']] = df['text'].str.extract('(-?\d+(?:\.\d+)?)/(-?\d+(?:\.\d+)?)').astype(float)
>>> df
id text
0 1 foo 14.12/10.123 bar
1 2 10/12
2 3 13.4/14.5
3 4 -12.24/-13.5
4 5 1/-1.2
>>>
>>> df[['rating_nominator', 'rating_denominator']] = df['text'].str.extract('(-?\d+(?:\.\d+)?)/(-?\d+(?:\.\d+)?)').astype(float)
>>> df
id text rating_nominator rating_denominator
0 1 foo 14.12/10.123 bar 14.12 10.123
1 2 10/12 10.00 12.000
2 3 13.4/14.5 13.40 14.500
3 4 -12.24/-13.5 -12.24 -13.500
4 5 1/-1.2 1.00 -1.20