Python 我试图刮一个网站的链接,也刮内已经刮链接的链接
我正在尝试刮取一个网站的链接,刮取后,我还想看看我刮取的链接是否只是文章或包含更多链接,如果是,我也想刮取这些链接。我正在尝试使用BeautifulSoup 4实现它,这是我迄今为止的代码:Python 我试图刮一个网站的链接,也刮内已经刮链接的链接,python,beautifulsoup,python-requests-html,Python,Beautifulsoup,Python Requests Html,我正在尝试刮取一个网站的链接,刮取后,我还想看看我刮取的链接是否只是文章或包含更多链接,如果是,我也想刮取这些链接。我正在尝试使用BeautifulSoup 4实现它,这是我迄今为止的代码: import requests from bs4 import BeautifulSoup url ='https://www.lbbusinessjournal.com/' try: r = requests.get(url, headers={'User-Agent': user_agent})
import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
r = requests.get(url, headers={'User-Agent': user_agent})
soup = BeautifulSoup(r.text, 'html.parser')
for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
link = post.find('a').get('href')
print(link)
r = requests.get(link, headers={'User-Agent': user_agent})
soup1 = BeautifulSoup(r.text, 'html.parser')
for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
link1 = post1.find('a').get('href')
print(link1)
except Exception as e:
print(e)
我想要页面上的链接,刮取我从该页面获得的链接中可能存在的链接,例如,我也想要页面内的链接。到目前为止,我只从主页上获取链接。尝试从
except
子句中获取raisee
,您将看到错误
AttributeError:“非类型”对象没有属性“get”
源于行link1=post1.find('a').get('href')
,其中post1.find('a')
返回None
——这是因为您检索的HTMLh3
元素中至少有一个没有a
元素——事实上,该链接似乎在HTML中被注释掉了
相反,您应该将此post1.find('a').get('href')
调用分为两个步骤,并在尝试获取'href'
属性之前检查post1.find('a')
返回的元素是否为None
,即:
for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
element = post1.find('a')
if element is not None:
link1 = element.get('href')
print(link1)
使用此更改运行代码的输出:
https://www.lbbusinessjournal.com/
https://www.lbbusinessjournal.com/this-virus-doesnt-have-borders-port-official-warns-of-pandemics-future-economic-impact/
https://www.lbbusinessjournal.com/pharmacy-and-grocery-store-workers-call-for-increased-protections-against-covid-19/
https://www.lbbusinessjournal.com/up-close-and-personal-grooming-businesses-struggle-in-times-of-social-distancing/
https://www.lbbusinessjournal.com/light-at-the-end-of-the-tunnel-long-beach-secures-contract-for-new-major-convention/
https://www.lbbusinessjournal.com/hospitals-prepare-for-influx-of-coronavirus-patients-officials-worry-it-wont-be-enough/
https://www.lbbusinessjournal.com/portside-keeping-up-with-the-port-of-long-beach-18/
https://www.lbbusinessjournal.com/news/
...
那么,当您运行代码时,当前输出是什么?它没有找到页面上的链接吗?它没有按照我的要求提供链接内的链接。@dspencer如果我打印post1,那么我可以看到链接内的链接,但是如果我尝试在元素内获得“href”属性,它就不会打印。仅供参考,它是scrap(和scraping,scraped,scraper)而不是scrape