Python 请求无法获取页面

Python 请求无法获取页面,python,beautifulsoup,python-requests,user-agent,Python,Beautifulsoup,Python Requests,User Agent,我正在尝试使用美丽的汤来恢复: 这是我尝试的代码: import requests from bs4 import BeautifulSoup page = requests.get("https://www.nasdaq.com/market-activity/stocks/msft/news-headlines") 每次我运行代码时,它都会卡住,无法检索页面。但是,我收到过一次ReadTimeout异常(requests.exceptions.ReadTimeout

我正在尝试使用美丽的汤来恢复:

这是我尝试的代码:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.nasdaq.com/market-activity/stocks/msft/news-headlines")

每次我运行代码时,它都会卡住,无法检索页面。但是,我收到过一次ReadTimeout异常(
requests.exceptions.ReadTimeout:HTTPSConnectionPool(host='www.nasdaq.com',port=443):读取超时。(读取超时=无)

对此问题的任何帮助或修复都将不胜感激。

请不要这样做

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.nasdaq.com/market-activity/stocks/msft/news-headlines")
尝试以下方式检索网页:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup 

page = Request("https://www.nasdaq.com/market-activity/stocks/msft/news-headlines")

我在我的请求中包含了标题,它似乎起了作用。我使用了我的浏览器发送的相同标题,您可以使用开发人员工具(as)找到这些标题


此代码试图使用
请求
,而不是美化组来读取URL。您请求的网站似乎发送了大量数据,或者从未真正关闭连接,导致您提到的
ReadTimeout
,或者只是挂起。我不确定是否有解决方案,但我确信研究“为什么请求.get挂起”之类的东西会产生一些有用的结果。我怀疑在标题中只需要用户代理部分,例如
“用户代理”:“Mozilla/5.0(Windows NT 10.0;Win64;x64)AppleWebKit/537.36(KHTML,如Gecko)Chrome/83.0.4103.116 Safari/537.36”
。您可以尝试只在标题中包含
user-agent
字段。我支持@zmike的评论。网站通常需要“合法”的用户代理头。我之所以说合法,是因为有些人会拒绝没有特别“正常”的用户代理头的请求,这当然是完全合法的,尽管很奇怪。
import requests

headers = {
    "authority": "www.nasdaq.com",
    "method": "GET",
    "path": "/market-activity/stocks/msft/news-headlines",
    "scheme": "https",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-CA,en;q=0.9,ro-RO;q=0.8,ro;q=0.7,en-GB;q=0.6,en-US;q=0.5",
    "cache-control": "max-age=0",
    "dnt": "1",
    "if-modified-since": "Tue, 30 Jun 2020 19:43:05 GMT",
    "if-none-match": "1593546185",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
}

page = requests.get("https://www.nasdaq.com/market-activity/stocks/msft/news-headlines", headers=headers)