我无法使用python从给定的网站中获取web数据_Python_Web Scraping_Python Requests

我无法使用python从给定的网站中获取web数据

python web-scraping

我无法使用python从给定的网站中获取web数据,python,web-scraping,python-requests,Python,Web Scraping,Python Requests,嗨，我正试图从网站上抓取数据。我希望所有的城市名称，并再次从链接刮的数据。但是，在python中使用请求库会出现一些问题。有一些会话、cookie或其他东西正在停止对数据进行爬网。请帮帮我 >>> import requests >>> url = 'https://health.usnews.com/doctors/city-index/new-jersey' >>> html_content = requests.get(url) >

嗨，我正试图从网站上抓取数据。我希望所有的城市名称，并再次从链接刮的数据。但是，在python中使用请求库会出现一些问题。有一些会话、cookie或其他东西正在停止对数据进行爬网。请帮帮我

>>> import requests
>>> url = 'https://health.usnews.com/doctors/city-index/new-jersey'
>>> html_content = requests.get(url)
>>> html_content.status_code
403
>>> html_content.content
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http&#58;&#47;&#47;health&#46;usnews&#46;com&#47;doctors&#47;city&#45;index&#47;new&#45;jersey" on this server.<P>\nReference&#32;&#35;18&#46;7d70b17&#46;1528874823&#46;3fac5589\n</BODY>\n</HTML>\n'
>>>

导入请求 >>>url='1〕https://health.usnews.com/doctors/city-index/new-jersey' >>>html\u content=requests.get（url） >>>html\u content.status\u代码 403 >>>html_content.content “\n访问被拒绝\n\n访问被拒绝\n\n您无权访问此服务器上的“http:；/；/；健康和#46；usnews.；com/；医生和#47；城市和45；索引和/；新泽西和#45；泽西”。

\n参考&&35; 32#18.7d70b17.；1528874823.3fac5589\n\n\n' >>>

这是我收到的错误。

您需要在请求中添加标题，以便站点认为您是真正的用户

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
html_content = requests.get(url, headers=headers)

首先，正如前面的回答所建议的，我建议您在代码中添加一个标题，因此您的代码应该如下所示：

import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0'}
url = 'https://health.usnews.com/doctors/city-index/new-jersey'
html_content = requests.get(url, headers=headers)
html_content.status_code
print(html_content.text)

我也尝试过header，即使它不起作用。它有默认的header用户代理：Mozilla/5.0（X11；Linux x86_64）AppleWebKit/537.36（KHTML，像Gecko）Chrome/54.0.2840.100 Safari/537.36，即使它也不起作用。它对我起作用。可能是你打了太多的电话，这个网站屏蔽了你的ip。你可以尝试从一些不同的ip服务器。如果你在一段时间内打了太多的电话，他们也可能会做一个临时的ip阻塞。在以后的情况下，你可以在你的代码中添加一些延迟。在重新检查时，似乎流行的服务器IP被阻止了。我尝试从heroku和aws运行，得到了这个标题作为响应

{'Content-Length'：'312'，'Expires'：'Wed，Jun 2018 10:45:28 GMT'，'Server'：'AkamaiGHost'，'Connection'：'close'，'Cache Control'：'max age=0'，'Date'：'Wed，Jun 2018 10:45:28 GMT'，'Content Type'：'text/html'，'Mime Version:'1.0'}

。查看服务器的响应“AkamaiGHost”。你必须找到一种方法来愚弄这里的akami请求筛选器。你是从你的计算机还是从服务器上运行此代码？从我的计算机嗯，奇怪的是，你仍然有问题吗？是的，我也尝试过使用lib。你是否尝试过使用代理？当你访问该网站时，它仍然看起来像一个代理好的？我也使用了这个代码，但我得到的错误是一样的。不，我刚刚用inspect在网页中找到的标题发布了。你能不能直接复制我发布的代码，然后告诉我你得到的是不是相同的error@rofelia09您使用的标题与我在答案中输入的标题不同（它尝试使用您的标题，我得到了与您相同的错误）但使用您的标题，我也得到了相同的错误。