Python 按顺序从HTML中提取数据
如何从Wikipedia页面中提取所有表内容和前面的数据,例如,如果数据采用这种重复格式Python 按顺序从HTML中提取数据,python,beautifulsoup,html-parsing,Python,Beautifulsoup,Html Parsing,如何从Wikipedia页面中提取所有表内容和前面的数据,例如,如果数据采用这种重复格式 <p> <b> Order </b> : <a class="mw-redirect" href="/wiki/Passeriformes" title="Passeriformes"> Passeriformes </a> <span class="nowrap"> </span>
<p>
<b>
Order
</b>
:
<a class="mw-redirect" href="/wiki/Passeriformes" title="Passeriformes">
Passeriformes
</a>
<span class="nowrap">
</span>
<b>
Family
</b>
:
<a class="mw-redirect" href="/wiki/Passeridae" title="Passeridae">
Passeridae
</a>
</p>
<p>
<a href="/wiki/Sparrow" title="Sparrow">
Sparrows
</a>
are small passerine birds ...
</p>
<table class="wikitable" width="72%">
<tr>
<th width="24%">
Common name
</th>
<th width="24%">
Binomial
</th>
<th width="24%">
Status
</th>
</tr>
<tr>
<td>
<a href="/wiki/House_sparrow" title="House sparrow">
House sparrow
</a>
</td>
<td>
<i>
Passer domesticus
</i>
</td>
<td>
Trinidad only - Introduced species
</td>
</tr>
</table>
命令
:
家庭
:
是小型雀形目鸟类。。。
通用名
二项式
地位
家蝇
仅特立尼达-引进物种
所需的输出格式为
顺序、家族、描述、姓名、二项式、状态。方法:
所有想要的标签都是彼此的兄弟。所以,基本上,您必须使用函数来查找它们
说明:
所有鸟类类型的名称(标题)都位于
标签内。但是,第一个
标记是用于内容的(所以跳过它)。订单和系列位于
标签内,该标签位于
标签之后。您可以使用h2.find\u next\u sibling('p')
找到它。可以使用h2找到带有名称、二项式和状态的表。查找下一个兄弟姐妹('table')
使用所有这些,您可以打印所需的所有详细信息。但是,当到达包含引用的
标记时,必须打破循环。这可以通过使用
if h2.find('span', class_='mw-headline').text == 'References':
break
代码:
部分输出:
Tinamous
Tinamiformes | Tinamidae | Little tinamou | Crypturellus soui | Trinidad only
Screamers
Anseriformes | Anhimidae | Horned screamer | Anhima cornuta | Trinidad only - rare/accidental
Ducks, geese, and waterfowl
Anseriformes | Anatidae | Fulvous whistling-duck | Dendrocygna bicolor | Trinidad only
Anseriformes | Anatidae | White-faced whistling-duck | Dendrocygna viduata | Trinidad only - rare/accidental
...
...
Waxbills and allies
Passeriformes | Estrildidae | Common waxbill | Estrilda astrild | Trinidad, accidental Tobago - introduced species
Passeriformes | Estrildidae | Tricolored munia | Lonchura malacca | Trinidad only - introduced species
Old World sparrows
Passeriformes | Passeridae | House sparrow | Passer domesticus | Trinidad only - Introduced species
美丽的。书面解释也很好。谢谢
Tinamous
Tinamiformes | Tinamidae | Little tinamou | Crypturellus soui | Trinidad only
Screamers
Anseriformes | Anhimidae | Horned screamer | Anhima cornuta | Trinidad only - rare/accidental
Ducks, geese, and waterfowl
Anseriformes | Anatidae | Fulvous whistling-duck | Dendrocygna bicolor | Trinidad only
Anseriformes | Anatidae | White-faced whistling-duck | Dendrocygna viduata | Trinidad only - rare/accidental
...
...
Waxbills and allies
Passeriformes | Estrildidae | Common waxbill | Estrilda astrild | Trinidad, accidental Tobago - introduced species
Passeriformes | Estrildidae | Tricolored munia | Lonchura malacca | Trinidad only - introduced species
Old World sparrows
Passeriformes | Passeridae | House sparrow | Passer domesticus | Trinidad only - Introduced species