Python 在表格中拖动链接,单击链接&;刮取数据
我有一个表,我希望拿起所有的链接,通过链接和刮在td类=马的项目 包含所有链接的表所在的主页具有以下代码:Python 在表格中拖动链接,单击链接&;刮取数据,python,selenium,beautifulsoup,python-requests,Python,Selenium,Beautifulsoup,Python Requests,我有一个表,我希望拿起所有的链接,通过链接和刮在td类=马的项目 包含所有链接的表所在的主页具有以下代码: <table border="0" cellspacing="0" cellpadding="0" class="full-calendar"> <tr> <th width="160"> </th> <th width="105"><a href="/FreeF
<table border="0" cellspacing="0" cellpadding="0" class="full-calendar">
<tr>
<th width="160"> </th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=NSW">NSW</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=VIC">VIC</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=QLD">QLD</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=WA">WA</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=SA">SA</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=TAS">TAS</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=ACT">ACT</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=NT">NT</a></th>
</tr>
<tr class="rows">
<td>
<p><span>FRIDAY 13 JAN</span></p>
</td>
<td>
<p>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a><br>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a><br>
</p>
</td>
<td>
<p>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a><br>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a><br>
</p>
</td>
<td>
<p>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,QLD,Doomben">Doomben</a><br>
</p>
</td>
想知道是否有人可以帮助我如何让代码单击表中的所有链接&对每个页面执行以下操作
g data = soup.findall("td",{"class":"horse"})
for item in g_data:
print item.text
提前谢谢
import requests, bs4, re
from urllib.parse import urljoin
start_url = 'http://www.racingaustralia.horse/'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/FreeFields/"))
links = [urljoin(start_url, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
tds = soup.find_all('td', class_="horse")
if not tds:
print(link, 'do not find hours tag')
else:
for td in tds:
print(td.text)
if __name__ == '__main__':
links = get_links(start_url)
for link in links:
get_tds(link)
输出:
bs4+请求可以满足您的需要
输出:
bs4+请求可以满足您的需要。单击链接是什么意思?意思是说,转到链接页面,然后将所有链接都删除?是的,因此该表包含如下数据,1月13日星期五
@KirstyDent请将任何相关数据,如上面评论中的HTML,为了让以后的读者更容易找到问题本身。抱歉-我现在就做!“点击链接”是什么意思?意思是说,转到链接页面,然后将所有链接都删除?是的,因此该表包含如下数据,1月13日星期五
@KirstyDent请将任何相关数据,如上面评论中的HTML,为了让以后的读者更容易找到问题本身。抱歉-我现在就做!如何将分页添加到此代码中?主页上有多个页面的地方显示您是否将分页添加到此代码中?其中主页有多个页面
import requests, bs4, re
from urllib.parse import urljoin
start_url = 'http://www.racingaustralia.horse/'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/FreeFields/"))
links = [urljoin(start_url, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
tds = soup.find_all('td', class_="horse")
if not tds:
print(link, 'do not find hours tag')
else:
for td in tds:
print(td.text)
if __name__ == '__main__':
links = get_links(start_url)
for link in links:
get_tds(link)
http://www.racingaustralia.horse/FreeFields/GroupAndListedRaces.aspx do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=NSW do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=VIC do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=QLD do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=WA do not find hours tag
.......
WEARETHECHAMPIONS
STORMY HORIZON
OUR RED JET
SAPPER TOM
MY COUSIN BOB
ALL TOO HOT
SAGA DEL MAR
ZIGZOFF
SASHAY AWAY
SO SHE IS
MILADY DUCHESS