在浏览器内部工作时，Python爬网URL请求返回404错误_Python_Request_Web Crawler_Robot

在浏览器内部工作时，Python爬网URL请求返回404错误

python web-crawler

在浏览器内部工作时，Python爬网URL请求返回404错误,python,request,web-crawler,robot,Python,Request,Web Crawler,Robot,我有一个挂在url上的爬行python脚本：pulsepoint.com/sellers.json bot使用标准请求获取内容，但返回错误404。在浏览器中，它可以工作（有一个301重定向，但请求可以随之进行）。我的第一个预感是这可能是一个请求头问题，所以我复制了我的浏览器配置。代码如下所示 crawled_url="pulsepoint.com" seller_json_url = 'http://{thehost}/sellers.json'.format(t

我有一个挂在url上的爬行python脚本：

pulsepoint.com/sellers.json

bot使用标准请求获取内容，但返回错误404。在浏览器中，它可以工作（有一个301重定向，但请求可以随之进行）。我的第一个预感是这可能是一个请求头问题，所以我复制了我的浏览器配置。代码如下所示

        crawled_url="pulsepoint.com"
        seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
        print(seller_json_url)
        myheaders = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
                'Accept-Encoding': 'gzip, deflate, br',
                'Connection': 'keep-alive',
                'Pragma': 'no-cache',
                'Cache-Control': 'no-cache'
            }
        r = requests.get(seller_json_url, headers=myheaders)
        logging.info("  %d" % r.status_code)

但我仍然得到一个404错误

我的下一个猜测是：

登录？这里不用
饼干？我看不出来

那么他们的服务器是如何阻止我的机器人的呢？这是一个应该被爬网的url，没有什么违法的

提前谢谢

您可以直接转到链接并提取数据，无需将301转到正确的链接

import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
    url="https://projects.contextweb.com/sellersjson/sellers.json",
    headers=headers,
    verify=False,
)

您可以直接转到链接并提取数据，无需将301转到正确的链接

import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
    url="https://projects.contextweb.com/sellersjson/sellers.json",
    headers=headers,
    verify=False,
)

您还可以对SSL证书错误执行如下解决方法：

from urllib.request import urlopen
import ssl
import json

#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)

response = urlopen(seller_json_url).read() 
# print in dictionary format
print(json.loads(response))

样本响应：

{'contact_email'：'PublisherSupport@pulsepoint.com“，”联系地址“：”纽约州纽约市麦迪逊大道360号，14楼，纽约，10017“，”版本“：”1.0“，”标识符“：[{'name'：'TAG-ID'，'value'：'89ff185a4c4e857c'}]，”卖家“：[{'seller\u-ID'：'508738'，”

…'seller_type'：'PUBLISHER'}，{'seller_id'：'562225'，'name'：'EL DIARIO'，'domain'：'impremedia.com'，'seller_type'：'PUBLISHER'}}

您还可以对SSL证书错误进行如下解决方法：

from urllib.request import urlopen
import ssl
import json

#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)

response = urlopen(seller_json_url).read() 
# print in dictionary format
print(json.loads(response))

样本响应：

…卖家类型：'PUBLISHER'}，{'seller'id'：'562225'，'name'：'eldiario'，'domain'：'impremedia.com'，'seller'u-type'：'PUBLISHER'}}

好的，就其他人而言，这是一个“Xmo'367;”答案的强化版本，因为：

一些网站想要标题来回答
有些网站使用奇怪的编码
一些网站在未被要求时发送Gzip答案

再次感谢

好吧，就对其他人来说，这是一个强硬版本的Xmo的答案，因为：

一些网站想要标题来回答
有些网站使用奇怪的编码
一些网站在未被要求时发送Gzip答案

再次感谢

谢谢你的建议，但是我有几百个url，一些有sellers.json注册，一些没有，还有一些正在使用重定向。所以我希望有一个适合大多数情况的健壮的爬虫程序。

verify=False

，谢谢。谢谢你的建议，但是我有几百个url，一些有sellers.json注册，一些没有，还有一些正在使用重定向。所以我想要一个适合大多数情况的健壮的爬虫程序。

verify=False

，谢谢。谢谢你的回答！您的代码可以工作，但需要重定向url，而不是原始url。我有几个重定向的网址，工作得很好，我不明白为什么这一个将需要有其他域硬编码..抱歉使用重定向的网址。我更新了我的答案，谢谢你！我不得不在解决方案上做一些工作，因为我的头在URL中的尊重程度有所不同，但是你的答案是正确的。我将添加一个作为补充！谢谢你的回答！您的代码可以工作，但需要重定向url，而不是原始url。我有几个重定向的网址，工作得很好，我不明白为什么这一个将需要有其他域硬编码..抱歉使用重定向的网址。我更新了我的答案，谢谢你！我不得不在解决方案上做一些工作，因为我的头在URL中的尊重程度有所不同，但是你的答案是正确的。我将添加一个作为补充！