在Python/Pandas中提取HTML标记中的单词_Html_Python 3.x_Pandas_Selenium Webdriver_Beautifulsoup

在Python/Pandas中提取HTML标记中的单词

html python-3.x pandas selenium-webdriver

在Python/Pandas中提取HTML标记中的单词,html,python-3.x,pandas,selenium-webdriver,beautifulsoup,Html,Python 3.x,Pandas,Selenium Webdriver,Beautifulsoup,我有我刮掉的HTML文本，需要格式化成表格。我想用粗体标记提取所有内容：我有以下代码： import pandas as pd html='HRShohei Ohtani 2BMike Trout(2)/ SFBilly Bob' 这将产生： print(html_df) content 0 <

我有我刮掉的HTML文本，需要格式化成表格。我想用粗体标记提取所有内容：

我有以下代码：

import pandas as pd
html='<b>HR</b>Shohei Ohtani<br><b>2B</b>Mike Trout(2)/<br><b>SF</b>Billy Bob'

这将产生：

print(html_df)
                     content
0    <b>HR</b>Shohei Ohtani<
1  ><b>2B</b>Mike Trout(2)/<
2        ><b>SF</b>Billy Bob

打印（html\u-df）
内容
0 HRShohei Ohtani<
1>2米鳟鱼（2）/<
2>比利·鲍勃

我想要这个：

print(html_df)
                     content var
0    <b>HR</b>Shohei Ohtani< HR
1  ><b>2B</b>Mike Trout(2)/< 2B
2        ><b>SF</b>Billy Bob SF

打印（html\u-df）
内容变量
0 HRShohei Ohtani2B鳟鱼（2）/<2B
2>旧金山

我试着用漂亮的汤和。芬德尔，没用。我对不同的方法持开放态度，包括颠倒我的一些步骤。

这就是你需要的吗

from bs4 import BeautifulSoup
html='<b>HR</b>Shohei Ohtani<br><b>2B</b>Mike Trout(2)/<br><b>SF</b>Billy Bob'
soup = BeautifulSoup(html)
b_tags = soup.find_all('b')

for b_tag in b_tags:
   print(b_tag.text)

从bs4导入美化组
html='HRShohei Ohtani
2b Mike Trout（2）/
SFBilly Bob'
soup=BeautifulSoup（html）
b_tags=soup.find_all（'b'））
对于b_标记中的b_标记：
打印（b_标签文本）

解决方案只需使用一行代码，如下所示：

html_df['var'] = html_df['content'].str.extract(r'<b>.*?(.*)</b>')

html_-df['var']=html_-df['content'].str.extract（r'.*？（.*））

结果

谢谢，这很有效。

html_df['var'] = html_df['content'].str.extract(r'<b>.*?(.*)</b>')