Python 3.x 如何使用python 3从网站中提取所有页面URL？_Python 3.x_Beautifulsoup_Python Requests

Python 3.x 如何使用python 3从网站中提取所有页面URL？

python-3.x

Python 3.x 如何使用python 3从网站中提取所有页面URL？,python-3.x,beautifulsoup,python-requests,Python 3.x,Beautifulsoup,Python Requests,我想从一个网站的所有网页的URL列表。以下代码不返回任何内容： from bs4 import BeautifulSoup import requests base_url = 'http://www.techadvisorblog.com' response = requests.get(base_url + '/a') soup = BeautifulSoup(response.text, 'html.parser') urls = [] for tr in soup.select('

我想从一个网站的所有网页的URL列表。以下代码不返回任何内容：

from bs4 import BeautifulSoup
import requests

base_url = 'http://www.techadvisorblog.com'
response = requests.get(base_url + '/a')
soup = BeautifulSoup(response.text, 'html.parser')

urls = []

for tr in soup.select('tbody tr'):
    urls.append(base_url + tr.td.a['href'])

来自后端的响应是406。您可以通过指定用户代理来克服这一问题

>>> response = requests.get(base_url + '/a', headers={"User-Agent": "XY"})

您可以获取URL

>>> for link in soup.find_all('a'):
...     print(link.get('href'))
...
#content
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://www.instagram.com/techadvisorblog
//www.pinterest.com/pin/create/button/?url=https://techadvisorblog.com/about-us/
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/
https://techadvisorblog.com/what-is-world-wide-web-www/
https://techadvisorblog.com/best-free-password-manager-for-windows-10/
https://techadvisorblog.com/solved-failed-to-start-emulator-the-emulator-was-not-properly-closed/
https://techadvisorblog.com/is-telegram-safe/
https://techadvisorblog.com/will-technology-ever-rule-the-world/
https://techadvisorblog.com/category/android/
https://techadvisorblog.com/category/knowledge/basic-computer/
https://techadvisorblog.com/category/games/
https://techadvisorblog.com/category/knowledge/
https://techadvisorblog.com/category/security/
http://Techadvisorblog.com/
http://Techadvisorblog.com
None
None
None
None
None
>>>

我不知道为什么要在url的末尾连接\a，因为这会重新指向“关于我们”页面。此外，我没有看到table/tr/td标记可用于基本url或关于我们的内容。相反，如果您想循环浏览作为基本url分页的两个（或更多）页面，您可以通过测试值为

next

的

rel

属性的存在来完成此操作。是的，您需要一个有效的用户代理头

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36',
}

page = 1
with requests.Session() as s:
    s.headers = headers
    while True:
        r = s.get(f'https://techadvisorblog.com/page/{page}/')
        soup = bs(r.content, 'lxml')
        print(soup.select_one('title').text)
        if soup.select_one('[rel=next]') is None:
            break
        page+=1

您能指出所需输出的一部分吗？为什么要将“/a”连接到重新指向

https://techadvisorblog.com/about-us/