python/beautifulsoup:查找具有特定属性的前一行

python/beautifulsoup:查找具有特定属性的前一行,python,beautifulsoup,css-selectors,Python,Beautifulsoup,Css Selectors,我正在用如下表格刮取一个html文件: <table> <tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr> <tr class="highlight1"><td>

我正在用如下表格刮取一个html文件:

<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>
PASSERIFORMES: Cardinalidae,Summer Tanager,Piranga rubra...
PASSERIFORMES: Cardinalidae,Scarlet Tanager,Piranga olivacea...
PASSERIFORMES: Cardinalidae,Rose-breasted Grosbeak,Pheucticus ludovicianus...
PASSERIFORMES: Buntings,Indigo Bunting,Passerina cyanea...
PASSERIFORMES: Buntings,Dickcissel,Spiza americana...

我想做的是为每一行获取“tr valign=“bottom”中的值。基本上,我知道如何使用beautifulsoup中的css选择器前进和向下搜索,但我不知道如何后退并在每个“tr class=“highlight1”之前选择“tr valign=“bottom”

我希望我的csv输出如下所示:

<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>
PASSERIFORMES: Cardinalidae,Summer Tanager,Piranga rubra...
PASSERIFORMES: Cardinalidae,Scarlet Tanager,Piranga olivacea...
PASSERIFORMES: Cardinalidae,Rose-breasted Grosbeak,Pheucticus ludovicianus...
PASSERIFORMES: Buntings,Indigo Bunting,Passerina cyanea...
PASSERIFORMES: Buntings,Dickcissel,Spiza americana...


我找不到任何这样的例子,我真的非常感谢任何帮助

你只需将你的桌子读入熊猫,然后按你认为合适的方式将其切成小块:

import pandas as pd
langs = """your html above"""
df=pd.read_html(langs)
df[0]
输出(请原谅格式):


你可以简单地将你的桌子读入熊猫,然后按你认为合适的方式将其切成小块:

import pandas as pd
langs = """your html above"""
df=pd.read_html(langs)
df[0]
输出(请原谅格式):


如果希望解决方案不包含
pandas
,可以使用以下脚本:

from bs4 import BeautifulSoup


txt = '''
<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>'''

soup = BeautifulSoup(txt, 'html.parser')

all_data = []
for tr in soup.select('tr:not(:has(td[colspan]))'):
    all_data.append([
        tr.find_previous('td', {'colspan': True}).get_text(strip=True), 
        *[td.get_text(strip=True) for td in tr.select('td')] 
    ])

# print data to screen:
for row in all_data:
    print(*row, sep=', ')

如果希望解决方案不包含
pandas
,可以使用以下脚本:

from bs4 import BeautifulSoup


txt = '''
<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>'''

soup = BeautifulSoup(txt, 'html.parser')

all_data = []
for tr in soup.select('tr:not(:has(td[colspan]))'):
    all_data.append([
        tr.find_previous('td', {'colspan': True}).get_text(strip=True), 
        *[td.get_text(strip=True) for td in tr.select('td')] 
    ])

# print data to screen:
for row in all_data:
    print(*row, sep=', ')

谢谢你的回复!我仍然不确定如何将输出更改为如下所示的行:1雀形目:红雀科夏檀香鸟Piranga rubra Piranga vermillonThanks!我仍然不确定如何将输出更改为如下所示的行:1雀形目:红雀科夏塘鳢比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红比兰加红!啊,我完全错过了BeautifulSoup有一个查找之前的函数!这确实回答了我关于不使用熊猫的解决方案的问题(尽管我仍然很好奇它是如何工作的!)。非常感谢!啊,我完全错过了BeautifulSoup有一个查找之前的函数!这确实回答了我关于不使用熊猫的解决方案的问题(尽管我仍然好奇这可能如何工作!)。