Python请求错误400浏览器发送了无效请求_Python_Python Requests_Web Crawler

Python请求错误400浏览器发送了无效请求

python web-crawler

Python请求错误400浏览器发送了无效请求,python,python-requests,web-crawler,Python,Python Requests,Web Crawler,我在网络爬网/抓取方面的知识非常有限，正在尝试创建一个指向此URL的网络爬虫。但是，当我尝试从服务器打印响应文本时，我得到以下结果： <html><body><h1>400 Bad request</h1> Your browser sent an invalid request. </body></html> 尝试使用BeautifulSoup和一个标题将您的请求屏蔽为真实请求： import requests,lxml

我在网络爬网/抓取方面的知识非常有限，正在尝试创建一个指向此

URL

的网络爬虫。但是，当我尝试从服务器打印响应文本时，我得到以下结果：

<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>

尝试使用

BeautifulSoup

和一个标题将您的请求屏蔽为真实请求：

import requests,lxml
from bs4 import BeautifulSoup
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.content, "lxml")
print(soup)

单靠掩蔽也是有效的：

import requests
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)

@谢谢你！我已经想了好几天了。但是你能解释一下为什么它不能只使用url吗？我认为服务器会阻止机器人大多数域都有一个robots.txt来实现对机器人的限制

import requests
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)