Python 我试图刮一个网站的链接,也刮内已经刮链接的链接

Python 我试图刮一个网站的链接,也刮内已经刮链接的链接,python,beautifulsoup,python-requests-html,Python,Beautifulsoup,Python Requests Html,我正在尝试刮取一个网站的链接,刮取后,我还想看看我刮取的链接是否只是文章或包含更多链接,如果是,我也想刮取这些链接。我正在尝试使用BeautifulSoup 4实现它,这是我迄今为止的代码: import requests from bs4 import BeautifulSoup url ='https://www.lbbusinessjournal.com/' try: r = requests.get(url, headers={'User-Agent': user_agent})

我正在尝试刮取一个网站的链接,刮取后,我还想看看我刮取的链接是否只是文章或包含更多链接,如果是,我也想刮取这些链接。我正在尝试使用BeautifulSoup 4实现它,这是我迄今为止的代码:

import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
    r = requests.get(url, headers={'User-Agent': user_agent})
    soup = BeautifulSoup(r.text, 'html.parser')
    for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
        link = post.find('a').get('href')
        print(link)
        r = requests.get(link, headers={'User-Agent': user_agent})
        soup1 = BeautifulSoup(r.text, 'html.parser')
        for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
            link1 = post1.find('a').get('href')
            print(link1)
except Exception as e:
    print(e)

我想要页面上的链接,刮取我从该页面获得的链接中可能存在的链接,例如,我也想要页面内的链接。到目前为止,我只从主页上获取链接。

尝试从
except
子句中获取
raisee
,您将看到错误

AttributeError:“非类型”对象没有属性“get”

源于行
link1=post1.find('a').get('href')
,其中
post1.find('a')
返回
None
——这是因为您检索的HTML
h3
元素中至少有一个没有
a
元素——事实上,该链接似乎在HTML中被注释掉了

相反,您应该将此
post1.find('a').get('href')
调用分为两个步骤,并在尝试获取
'href'
属性之前检查
post1.find('a')
返回的元素是否为
None
,即:

for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):                                                     
    element = post1.find('a')                                           
    if element is not None:                                             
        link1 = element.get('href')                                     
        print(link1)
使用此更改运行代码的输出:

https://www.lbbusinessjournal.com/
https://www.lbbusinessjournal.com/this-virus-doesnt-have-borders-port-official-warns-of-pandemics-future-economic-impact/
https://www.lbbusinessjournal.com/pharmacy-and-grocery-store-workers-call-for-increased-protections-against-covid-19/
https://www.lbbusinessjournal.com/up-close-and-personal-grooming-businesses-struggle-in-times-of-social-distancing/
https://www.lbbusinessjournal.com/light-at-the-end-of-the-tunnel-long-beach-secures-contract-for-new-major-convention/
https://www.lbbusinessjournal.com/hospitals-prepare-for-influx-of-coronavirus-patients-officials-worry-it-wont-be-enough/
https://www.lbbusinessjournal.com/portside-keeping-up-with-the-port-of-long-beach-18/
https://www.lbbusinessjournal.com/news/
...

那么,当您运行代码时,当前输出是什么?它没有找到页面上的链接吗?它没有按照我的要求提供链接内的链接。@dspencer如果我打印post1,那么我可以看到链接内的链接,但是如果我尝试在元素内获得“href”属性,它就不会打印。仅供参考,它是scrap(和scraping,scraped,scraper)而不是scrape