Python 表格提取:BeautifulSoup vs.Pandas.read_html
我有一个取自此的html文件,但我无法使用bs4.BeautifulSoup()或pandas.read_html提取任何类型的表。我知道所需表格的每一行都以Python 表格提取:BeautifulSoup vs.Pandas.read_html,python,pandas,web-scraping,html-table,beautifulsoup,Python,Pandas,Web Scraping,Html Table,Beautifulsoup,我有一个取自此的html文件,但我无法使用bs4.BeautifulSoup()或pandas.read_html提取任何类型的表。我知道所需表格的每一行都以开头。尽管如此,当我通过soup.find({'class':'odd'})或pd.read\uhtml(url,attrs={'class':'odd'})时,有些东西不起作用。错误在哪里,或者我应该怎么做 表的开头显然是从requests.get(url.content[8359://code>开始的 <table style="
开头。尽管如此,当我通过soup.find({'class':'odd'})
或pd.read\uhtml(url,attrs={'class':'odd'})
时,有些东西不起作用。错误在哪里,或者我应该怎么做
表的开头显然是从requests.get(url.content[8359://code>开始的
<table style="background-color:#FFFEEE; border-width:thin; border-collapse:collapse; border-spacing:0; border-style:outset;" rules="groups" >
<colgroup>
<colgroup>
<colgroup>
<colgroup>
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup>
<tbody>
<tr style="vertical-align:middle; background-color:#177A9C">
<th scope="col" style="text-align:center">Ion</th>
<th scope="col" style="text-align:center"> Observed <br /> Wavelength <br /> Vac (nm) </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>g<sub>k</sub>A<sub>ki</sub></i><br /> (10<sup>8</sup> s<sup>-1</sup>) </th>
<th scope="col"> Acc. </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>i</sub></i> <br /> (eV) </th>
<th> </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>k</sub></i> <br /> (eV) </th>
<th scope="col" style="text-align:center" colspan="3"> Lower Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center" colspan="3"> Upper Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center"> <i>g<sub>i</sub></i> </th>
<th scope="col" style="text-align:center"> <b>-</b> </th>
<th scope="col" style="text-align:center"> <i>g<sub>k</sub></i> </th>
<th scope="col" style="text-align:center"> Type </th>
</tr>
</tbody>
<tbody>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr class='odd'>
<td class="lft1"><b>C I</b> </td>
<td class="fix"> 193.090540 </td>
<td class="lft1">1.02e+01 </td>
<td class="lft1"> A</td>
<td class="fix">1.2637284 </td>
<td class="dsh">- </td>
<td class="fix">7.68476771 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i><sup>2</sup> </td>
<td class="lft1"> <sup>1</sup>D </td>
<td class="lft1"> 2 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i>3<i>s</i> </td>
<td class="lft1"> <sup>1</sup>P° </td>
<td class="lft1"> 1 </td>
<td class="rgt"> 5</td>
<td class="dsh">-</td>
<td class="lft1">3 </td>
<td class="cnt"><sup></sup><sub></sub></td>
</tr>
离子
观察到的
波长
Vac(nm)
gkAki
(108 s-1)
根据。
Ei
(电动汽车)
埃克
(电动汽车)
下层
形态,术语,J
上层
形态,术语,J
胃肠道
-
gk
类型
C I
193.090540
1.02e+01
A.
1.2637284
-
7.68476771
2s22p2
1D
2.
2s22p3s
1P°;
1.
5.
-
3.
您需要搜索标记,然后搜索类。因此,使用lxml解析器
soup = BeautifulSoup(yourdata, 'lxml')
for i in soup.find_all('tr',attrs={'class':"odd"}):
print(i.text)
从这一点上,您可以将这些数据直接写入一个文件或生成一个数组(列表列表-您的行),然后放入pandas等中。这段代码可以为您在这个项目上提供一个快速启动,但是,如果您正在找人来构建整个项目、请求数据、刮取、存储、,我会建议雇佣一些人或者学习如何去做。是BeautifulSoup文档
浏览一下(快速入门指南),您将几乎了解bs4上的所有内容
import requests
from bs4 import BeautifulSoup
from time import sleep
url = 'https://physics.nist.gov/'
second_part = 'cgi-bin/ASD/lines1.pl?spectra=C%20I%2C%20Ti%20I&limits_type=0&low_w=190&upp_w=250&unit=1&de=0&format=0&line_out=0&no_spaces=on&remove_js=on&en_unit=1&output=0&bibrefs=0&page_size=15&show_obs_wl=1&unc_out=0&order_out=0&max_low_enrg=&show_av=2&max_upp_enrg=&tsb_value=0&min_str=&A_out=1&A8=1&max_str=&allowed_out=1&forbid_out=1&min_accur=&min_intens=&conf_out=on&term_out=on&enrg_out=on&J_out=on&g_out=on&submit=Retrieve%20Data%27'
page = requests.get(url+second_part)
soup = BeautifulSoup(page.content, "lxml")
whole_table = soup.find('table', rules='groups')
sub_tbody = whole_table.find_all('tbody')
# the two above lines are used to locate the table and the content
# we then continue to iterate through sub-categories i.e. tbody-s > tr-s > td-s
for tag in sub_tbody:
if tag.find('tr').find('td'):
table_rows = tag.find_all('tr')
for tag2 in table_rows:
if tag2.has_attr('class'):
td_tags = tag2.find_all('td')
print(td_tags[0].text, '<- Is the ion')
print(td_tags[1].text, '<- Wavelength')
print(td_tags[2].text, '<- Some formula gk Aki')
# and so on...
print('--'*40) # unecessary but does print ----------...
else:
pass
导入请求
从bs4导入BeautifulSoup
从时间上导入睡眠
url='1〕https://physics.nist.gov/'
第二部分=1.P?光谱=C%C%C%C%C%C%C%C%C%C%C%C%C%C%C%C%C%C%I=0&low w w=190&U w=190&upp=250&U=1&D=0&D=0&D=0&D=0&D=1&D=1&D=1&D=0&D=0&D=0=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0&D=0;格式=0&D=0&P=0&str=&允许输出=1&禁止输出=1&最小累积=&最小意图=&conf_out=on&term_out=on&enrg_out=on&J_out=on&g_out=on&submit=Retrieve%20Data%27'
page=requests.get(url+第二部分)
汤=美汤(page.content,“lxml”)
整张桌子=汤。查找('table',规则='groups')
sub_tbody=整个表。查找所有('tbody'))
#上面两行用于定位表和内容
#然后,我们继续迭代子类别,即tbody-s>tr-s>td-s
对于子体中的标签:
if tag.find('tr').find('td'):
table_rows=tag.find_all('tr')
对于表中的tag2行:
如果tag2.has_attr('class'):
td_tags=tag2.find_all('td'))
打印(td_标记[0]。文本,'