Python 在数据帧中使用正则表达式提取和替换子字符串
我有这个:Python 在数据帧中使用正则表达式提取和替换子字符串,python,pandas,replace,substring,extract,Python,Pandas,Replace,Substring,Extract,我有这个: Title Num 0 <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span> 1 <span class="o-label--tiny">PROTÉINES<
Title
Num
0 <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>
1 <span class="o-label--tiny">PROTÉINES</span>
2 <span class="o-label--tiny">GLUCIDES</span>
<class 'pandas.core.frame.DataFrame'> Num Index(['Title'], dtype='object')
这是我开发的正则表达式:
(<span class=\"o-label--tiny\">)([a-zA-Z]+\s*\w*)(</span>)
试试第二个:
df['Title'] = df['Title'].str.replace('<span class=\"o-label--tiny\">', repl = '')
Title
Num
0 NaN
1 NaN
2 NaN
试试第三个:
df['Title'] = df[lambda df: df.columns[0]].str.extract('(>[a-zA-Z]+\s*\w*)', expand=False)
结果3:
Title
Num
0 NaN
1 NaN
2 NaN
我真的不知道我做错了什么,如果能帮我达到我想要的结果,我将不胜感激。谢谢大家! 使用:
正则表达式
我不想参与df,但我希望这是有用的:
import re
stringa = """
0 <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>
1 <span class="o-label--tiny">PROTÉINES</span>
2 <span class="o-label--tiny">GLUCIDES</span>
"""
pattern1 = "[0-9]"
pattern = ">(.*)<"
found = re.findall(pattern1, stringa)
found2 = re.findall(pattern, stringa)
for f in range(len(found)):
print(found[f] + " " + found2[f])
我正要补充这一点作为评论。你编辑了它。我删除我的answer@jezrael当前位置我尝试了你的代码,但都不适用于我。我还是得到了NaN而不是正确的字符串??我的字符串有什么不寻常的地方我没有考虑吗?@ChiChi-我不知道-可以通过pickle文件在我的电子邮件中向我发送你的真实数据吗
df['Title']]。to_pickle('data.pkl')
?@jezrael-是的,我会这样做。非常感谢您提供更详细的信息
df['Title'] = df[lambda df: df.columns[0]].str.extract('(>[a-zA-Z]+\s*\w*)', expand=False)
Title
Num
0 NaN
1 NaN
2 NaN
df['Title']=df['Title'].str.extract('<span class=\"o-label--tiny\">(.*)</span>',expand=False)
print (df)
Title
Num
0 VALEUR ÉNERGÉTIQUE
1 PROTÉINES
2 GLUCIDES
df['Title'] = df['Title'].str.extract('>(.*)<',expand=False)
print (df)
Title
Num
0 VALEUR ÉNERGÉTIQUE
1 PROTÉINES
2 GLUCIDES
import re
stringa = """
0 <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>
1 <span class="o-label--tiny">PROTÉINES</span>
2 <span class="o-label--tiny">GLUCIDES</span>
"""
pattern1 = "[0-9]"
pattern = ">(.*)<"
found = re.findall(pattern1, stringa)
found2 = re.findall(pattern, stringa)
for f in range(len(found)):
print(found[f] + " " + found2[f])
0 VALEUR ÉNERGÉTIQUE
1 PROTÉINES
2 GLUCIDES