Python 基于html标记获取表的内容_Python_Beautifulsoup

Python 基于html标记获取表的内容

python

Python 基于html标记获取表的内容,python,beautifulsoup,Python,Beautifulsoup,我有下表： <table id="sample"> <tbody> <tr class="toprow"> <td style="width:25%"></td> <td style="width:25%">Number of Jurisdictions</td>

我有下表：

<table id="sample">
    <tbody>
        <tr class="toprow">
            <td style="width:25%"></td>
            <td style="width:25%">Number of Jurisdictions</td>
            <td style="width:25%">Per cent of total</td>
        </tr>
        <tr>
            <td class="leftcol">Europe</td>
            <td class="data">44</td>
            <td class="data">29%</td>
        </tr>
 </tbody>
</table>

我能够得到标题：

['', 'Number of Jurisdictions', 'Per cent of total']

现在我想获取单元格的内容，但我不知道如何循环使用

标记，因为它的类可能会更改为“leftcol”或“data”

如果我理解正确，我会简化一下：

gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow'):
    for c in g.select('td'):
        cols.append(c.text)
    
for g in gdp.select('tr:not(.toprow)'):
    row = []
    for item in g.select('td'):
        row.append(item.text)
    rows.append(row)
pd.DataFrame(rows, columns=cols)

或者，您可以通过使用列表理解来进一步简化它（我认为，这是以降低可读性为代价的）：

cols = [c.text for g in gdp.select('tr.toprow') for c in g.select('td')]
rows = [[item.text for item in g.select('td')] for g in gdp.select('tr:not(.toprow)')]
pd.DataFrame(rows, columns=cols)

输出：

                        Number of Jurisdictions     Per cent of total
0   Europe              44                          29%
1   Africa              23                          15%
2   Middle East         13                           9%
3   Asia and Oceania    33                          22%
4   Americas            37                          25%
5   Totals             150                          100%

                        Number of Jurisdictions     Per cent of total
0   Europe              44                          29%
1   Africa              23                          15%
2   Middle East         13                           9%
3   Asia and Oceania    33                          22%
4   Americas            37                          25%
5   Totals             150                          100%