如何从<;td>;python中的表
我正在使用python和beautifulsoup来刮表。我想从以下内容的html表中提取URL:如何从<;td>;python中的表,python,xml,csv,beautifulsoup,web-crawler,Python,Xml,Csv,Beautifulsoup,Web Crawler,我正在使用python和beautifulsoup来刮表。我想从以下内容的html表中提取URL: <tbody> <tr> <td colspan="4" style="height:10px"></td> </tr> <tr class="header" id="a"> <td class=&q
<tbody>
<tr>
<td colspan="4" style="height:10px"></td>
</tr>
<tr class="header" id="a">
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
<td>A</td>
<td><a class="fa fa-angle-up goToTop pull-right" href="#" onclick="$('html, body').animate({scrollTop: 0}, 1000);return false;" title="Scroll to top"></a></td>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
</tr>
<tr>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
<td colspan="2"><a data-iata="ARD" data-lat="-8.13234" data-lon="124.597" href="https://www.flightradar24.com/data/airports/ard" title="Alor Island Airport"><img class="icon-airport" src="https://www.flightradar24.com/static/images/airport_pin_40_blue.png"/> Alor Island Airport <small>(ARD/WATM)</small> </a> </td>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
</tr>
<tr>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
<td colspan="2"><a data-iata="AMQ" data-lat="-3.71026" data-lon="128.089096" href="https://www.flightradar24.com/data/airports/amq" title="Ambon Pattimura Airport"><img class="icon-airport" src="https://www.flightradar24.com/static/images/airport_pin_40_blue.png"/> Ambon Pattimura Airport <small>(AMQ/WAPP)</small> </a> <span class="pull-right">Rating: 79%</span> </td>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
</tr>
<tr>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
<td colspan="2"><a data-iata="ABU" data-lat="-9.07444" data-lon="124.904404" href="https://www.flightradar24.com/data/airports/abu" title="Atambua Haliwen Airport"><img class="icon-airport" src="https://www.flightradar24.com/static/images/airport_pin_40_blue.png"/> Atambua Haliwen Airport <small>(ABU/WATA)</small> </a> </td>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
</tr>
<tr class="header" id="b">
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
<td>B</td>
<td><a class="fa fa-angle-up goToTop pull-right" href="#" onclick="$('html, body').animate({scrollTop: 0}, 1000);return false;" title="Scroll to top"></a></td>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
</tr>
<tr>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
<td colspan="2"><a data-iata="BXB" data-lat="-2.53224" data-lon="133.438797" href="https://www.flightradar24.com/data/airports/bxb" title="Babo Airport"><img class="icon-airport" src="https://www.flightradar24.com/static/images/airport_pin_40_blue.png"/> Babo Airport <small>(BXB/WASO)</small> </a> </td>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
</tr>
<tr>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
<td colspan="2"><a data-iata="BJW" data-lat="-8.7125" data-lon="121.0625" href="https://www.flightradar24.com/data/airports/bjw" title="Bajawa Turelelo Soa Airport"><img class="icon-airport" src="https://www.flightradar24.com/static/images/airport_pin_40_blue.png"/> Bajawa Turelelo Soa Airport <small>(BJW/WATB)</small> </a> </td>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
</tr>
<tr>
<td class="w40 hidden-xs hidden-sm hidden-xxs"> </td>
<td colspan="2"><a data-iata="BPN" data-lat="-1.26827" data-lon="116.894402" href="https://www.flightradar24.com/data/airports/bpn" title="Balikpapan Sepinggan Airport"><img class="icon-airport" src="https://www.flightradar24.com/static/images/airpo,...
代码如下
bs=BeautifulSoup(page.content, 'html.parser')
table_body=bs.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
for link in cols:
a = link.get("href")
print(a)
但是我得到了
无
在Python中有什么方法可以这样做吗?您缺少了一个循环href
包含在a
标记中
下面的代码输出正确
from bs4 import BeautifulSoup
bs=BeautifulSoup(html, 'html.parser')
table_body=bs.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
for col in cols:
a_list = col.find_all('a')
for a in a_list:
href = a.get("href")
print(href)
输出
#
https://www.flightradar24.com/data/airports/ard
https://www.flightradar24.com/data/airports/amq
https://www.flightradar24.com/data/airports/abu
#
https://www.flightradar24.com/data/airports/bxb
https://www.flightradar24.com/data/airports/bjw
https://www.flightradar24.com/data/airports/bpn
你错过了一个循环
href
包含在a
标记中
下面的代码输出正确
from bs4 import BeautifulSoup
bs=BeautifulSoup(html, 'html.parser')
table_body=bs.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
for col in cols:
a_list = col.find_all('a')
for a in a_list:
href = a.get("href")
print(href)
输出
#
https://www.flightradar24.com/data/airports/ard
https://www.flightradar24.com/data/airports/amq
https://www.flightradar24.com/data/airports/abu
#
https://www.flightradar24.com/data/airports/bxb
https://www.flightradar24.com/data/airports/bjw
https://www.flightradar24.com/data/airports/bpn
要从
中刮取所有
,您可以使用CSS选择器:tbody td a[data iata][href]
,这意味着“所有a
都有data iata
属性,该属性在tbody
下包含href
”
输出:
https://www.flightradar24.com/data/airports/ard
https://www.flightradar24.com/data/airports/amq
https://www.flightradar24.com/data/airports/abu
https://www.flightradar24.com/data/airports/bxb
https://www.flightradar24.com/data/airports/bjw
https://www.flightradar24.com/data/airports/bpn
要从
中刮取所有
,您可以使用CSS选择器:tbody td a[data iata][href]
,这意味着“所有a
都有data iata
属性,该属性在tbody
下包含href
”
输出:
https://www.flightradar24.com/data/airports/ard
https://www.flightradar24.com/data/airports/amq
https://www.flightradar24.com/data/airports/abu
https://www.flightradar24.com/data/airports/bxb
https://www.flightradar24.com/data/airports/bjw
https://www.flightradar24.com/data/airports/bpn