Python 如何使用beautiful soup从html表中提取数据_Python_Beautifulsoup

Python 如何使用beautiful soup从html表中提取数据

python

Python 如何使用beautiful soup从html表中提取数据,python,beautifulsoup,Python,Beautifulsoup,如何从下表中提取特定数据，如衰减时间91.1 ms 5 <table bgcolor=navy cellpadding=4 cellspacing=1 border=0 align=center> <tr class=hp > <td nowrap>E(level) (MeV)</td> <td nowrap>Jπ</td><td nowrap>Δ(MeV)

如何从下表中提取特定数据，如衰减时间91.1 ms 5

<table bgcolor=navy cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td><td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>


E级（兆欧）
J&pi&三角洲；（百万电子伏特）
T1/2
衰变模式
0
4+
18.2010
91.1毫秒5
&ε；：100.00 %
ε；p:55.00和#37
ε；2p:1.10%
ε&阿尔法；：0.04 %

您可以通过使用

get\u element\u by\u tag\u name

获取表格，并遍历每个内部标记，获取必要的数据。

假设标记已经存在于字符串中。您必须按类（.cp）查找元素，然后按标记（td）查找元素，您可以使用

.text

atribute获取每个找到元素的值，因此请使用以下代码：

import re
from bs4 import BeautifulSoup

html_doc = """<table bgcolor=navy cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td><td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>"""

soup = BeautifulSoup(html_doc, 'html.parser')
elements = soup.find_all(class_=re.compile("cp"))

for e in elements[0].find_all('td'):
    # the e.text contains the value of each td elements in your table
    print(e.text)

重新导入
从bs4导入BeautifulSoup
html_doc=“”
E级（兆欧）
J&pi；&Delta；（兆欧）
T1/2
衰变模式
0
4+
18.2010
91.1毫秒5
&ε：100.00%；
εp:55.00%；
ε2p:1.10%；
ε；&alpha；：0.04%；

"""
soup=BeautifulSoup（html_doc，'html.parser'）
elements=soup.find\u all（类=re.compile（“cp”））
对于元素[0]中的e，查找所有（'td'）：
#e.text包含表中每个td元素的值
打印（电子文本）

以下是将该表放入数据框的简单代码：

from bs4 import BeautifulSoup
import pandas as pd

page = """<table cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td>
    <td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>"""

soup = BeautifulSoup(page, "html.parser")
headers = soup.find('tr', {'class':'hp'}).findAll('td')
columns = []
for header in headers:
    columns.append(header.text)

data = []
data_raw = soup.findAll('tr',{'class':'cp'})
for row in data_raw:
    items = []
    for element in row.findAll('td'):
        items.append(element.text)
    data.append(items)

df = pd.DataFrame(data, columns=columns)

print(df['T1/2'])

如果衰减模式中有多行，您可能需要添加额外的代码来检测（它们由

分隔），或者如果可以，请更正HTML，使其在不同的行标记中有不同的行，在标题标记中有标题，如果我看到

标记，通常情况下，使用pandas

.read_html（）

是我第一件要尝试的事情。它将重新发布数据帧列表。然后，只需选择数据帧并操纵数据帧，以获得所需的数据，或提取所需的数据：

import pandas as pd


html = '''<table bgcolor=navy cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td><td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>'''

tables = pd.read_html(html)
df = tables[0]
df.columns = df.iloc[0,:]
df = df.iloc[1:,:]

你是在一个文件中有这个html，还是已经在一个字符串中有了它？我已经为你的问题添加了一个答案！谢谢你的帮助和亲切的回答@布拉德，你介意验证一下对你有帮助的答案吗？

import pandas as pd


html = '''<table bgcolor=navy cellpadding=4 cellspacing=1 border=0 align=center> 
  <tr class=hp >
    <td nowrap>E(level) (MeV)</td>
    <td nowrap>J&pi;</td><td nowrap>&Delta;(MeV)</td>
    <td nowrap>T<sub>1/2</sub></td>
    <td nowrap>Decay Modes</td>
  </tr>
  <tr class=cp>
    <td nowrap valign=top>0.0</td>
    <td nowrap valign=top>4+</td>
    <td nowrap valign=top> 18.2010</td>
    <td nowrap valign=top>91.1 ms <i>5</i>&nbsp;</td>
    <td nowrap valign=top> &epsilon; : 100.00 &#37;<br>  &epsilon;p : 55.00 &#37;<br>  &epsilon;2p : 1.10 &#37;<br>  &epsilon;&alpha; : 0.04 &#37;<br> </td>
  </tr>
</table>'''

tables = pd.read_html(html)
df = tables[0]
df.columns = df.iloc[0,:]
df = df.iloc[1:,:]

print(df.loc[1,'T1/2'])
91.1 ms 5