Python 如何使用BeautifulSoup查找所有下一个链接
我目前正在通过预设一个名为number_of_pages的变量来删除特定网站的所有页面。在添加我不知道的新页面之前,预设此变量一直有效。例如,下面的代码是3页,但网站现在有4页Python 如何使用BeautifulSoup查找所有下一个链接,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我目前正在通过预设一个名为number_of_pages的变量来删除特定网站的所有页面。在添加我不知道的新页面之前,预设此变量一直有效。例如,下面的代码是3页,但网站现在有4页 base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page=' number_of_pages = 3 for i in range(1, number_of_pages, 1): url_to_scrape = (bas
base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
number_of_pages = 3
for i in range(1, number_of_pages, 1):
url_to_scrape = (base_url + str(i))
我想使用BeautifulSoup查找网站上的所有下一个链接。下面的代码查找第二个URL,但不查找第三个或第四个URL。如何在删除所有页面之前建立一个列表
base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
CrawlRequest = requests.get(base_url)
raw_html = CrawlRequest.text
linkSoupParser = BeautifulSoup(raw_html, 'html.parser')
page = linkSoupParser.find('div', {'class': 'pagination'})
for list_of_links in page.find('a', href=True, text='next'):
nextURL = 'https://securityadvisories.paloaltonetworks.com' + list_of_links.parent['href']
print (nextURL)
有几种不同的方法来实现分页。这是其中之一 其思想是初始化一个无止境循环,并在没有“下一个”链接时将其中断: 如果执行,您将看到打印的以下消息:
Processing page: #1; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=
Processing page: #2; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=2
Processing page: #3; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=3
Processing page: #4; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=4
Done.
请注意,为了提高性能并在请求之间持久保存cookie,我们正在使用维护一个web抓取会话
Processing page: #1; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=
Processing page: #2; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=2
Processing page: #3; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=3
Processing page: #4; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=4
Done.