Python 基于html标记获取表的内容
我有下表:Python 基于html标记获取表的内容,python,beautifulsoup,Python,Beautifulsoup,我有下表: <table id="sample"> <tbody> <tr class="toprow"> <td style="width:25%"></td> <td style="width:25%">Number of Jurisdictions</td>
<table id="sample">
<tbody>
<tr class="toprow">
<td style="width:25%"></td>
<td style="width:25%">Number of Jurisdictions</td>
<td style="width:25%">Per cent of total</td>
</tr>
<tr>
<td class="leftcol">Europe</td>
<td class="data">44</td>
<td class="data">29%</td>
</tr>
</tbody>
</table>
我能够得到标题:
['', 'Number of Jurisdictions', 'Per cent of total']
现在我想获取单元格的内容,但我不知道如何循环使用
标记,因为它的类可能会更改为“leftcol”或“data”如果我理解正确,我会简化一下:
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow'):
for c in g.select('td'):
cols.append(c.text)
for g in gdp.select('tr:not(.toprow)'):
row = []
for item in g.select('td'):
row.append(item.text)
rows.append(row)
pd.DataFrame(rows, columns=cols)
或者,您可以通过使用列表理解来进一步简化它(我认为,这是以降低可读性为代价的):
cols = [c.text for g in gdp.select('tr.toprow') for c in g.select('td')]
rows = [[item.text for item in g.select('td')] for g in gdp.select('tr:not(.toprow)')]
pd.DataFrame(rows, columns=cols)
输出:
Number of Jurisdictions Per cent of total
0 Europe 44 29%
1 Africa 23 15%
2 Middle East 13 9%
3 Asia and Oceania 33 22%
4 Americas 37 25%
5 Totals 150 100%
Number of Jurisdictions Per cent of total
0 Europe 44 29%
1 Africa 23 15%
2 Middle East 13 9%
3 Asia and Oceania 33 22%
4 Americas 37 25%
5 Totals 150 100%