Python 在数据帧中使用正则表达式提取和替换子字符串

Python 在数据帧中使用正则表达式提取和替换子字符串,python,pandas,replace,substring,extract,Python,Pandas,Replace,Substring,Extract,我有这个: Title Num 0 <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span> 1 <span class="o-label--tiny">PROTÉINES<

我有这个:

                                                Title  
Num                                                      
0    <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>   
1         <span class="o-label--tiny">PROTÉINES</span>   
2          <span class="o-label--tiny">GLUCIDES</span> 

<class 'pandas.core.frame.DataFrame'> Num Index(['Title'], dtype='object')
这是我开发的正则表达式:

(<span class=\"o-label--tiny\">)([a-zA-Z]+\s*\w*)(</span>)
试试第二个:

df['Title'] = df['Title'].str.replace('<span class=\"o-label--tiny\">', repl = '')
   Title  
Num                                                         
0     NaN  
1     NaN  
2     NaN
试试第三个:

df['Title'] = df[lambda df: df.columns[0]].str.extract('(>[a-zA-Z]+\s*\w*)', expand=False)
结果3:

   Title  
Num                                                         
0     NaN  
1     NaN  
2     NaN
我真的不知道我做错了什么,如果能帮我达到我想要的结果,我将不胜感激。谢谢大家!

使用:

正则表达式 我不想参与df,但我希望这是有用的:

import re

stringa = """
0    <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>
1         <span class="o-label--tiny">PROTÉINES</span>
2          <span class="o-label--tiny">GLUCIDES</span>
"""

pattern1 = "[0-9]"
pattern = ">(.*)<"

found = re.findall(pattern1, stringa)
found2 = re.findall(pattern, stringa)

for f in range(len(found)):
    print(found[f] + " " + found2[f])

我正要补充这一点作为评论。你编辑了它。我删除我的answer@jezrael当前位置我尝试了你的代码,但都不适用于我。我还是得到了NaN而不是正确的字符串??我的字符串有什么不寻常的地方我没有考虑吗?@ChiChi-我不知道-可以通过pickle文件在我的电子邮件中向我发送你的真实数据吗
df['Title']]。to_pickle('data.pkl')
?@jezrael-是的,我会这样做。非常感谢您提供更详细的信息
df['Title'] = df[lambda df: df.columns[0]].str.extract('(>[a-zA-Z]+\s*\w*)', expand=False)
   Title  
Num                                                         
0     NaN  
1     NaN  
2     NaN
df['Title']=df['Title'].str.extract('<span class=\"o-label--tiny\">(.*)</span>',expand=False)
print (df)
                  Title
Num                    
0    VALEUR ÉNERGÉTIQUE
1             PROTÉINES
2              GLUCIDES
df['Title'] = df['Title'].str.extract('>(.*)<',expand=False)
print (df)
                  Title
Num                    
0    VALEUR ÉNERGÉTIQUE
1             PROTÉINES
2              GLUCIDES
import re

stringa = """
0    <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>
1         <span class="o-label--tiny">PROTÉINES</span>
2          <span class="o-label--tiny">GLUCIDES</span>
"""

pattern1 = "[0-9]"
pattern = ">(.*)<"

found = re.findall(pattern1, stringa)
found2 = re.findall(pattern, stringa)

for f in range(len(found)):
    print(found[f] + " " + found2[f])
0 VALEUR ÉNERGÉTIQUE
1 PROTÉINES
2 GLUCIDES