对于具有相同头和方法的GET请求，scrapy从python请求返回不同的状态代码_Python_Python 3.x_Web Scraping_Scrapy

对于具有相同头和方法的GET请求，scrapy从python请求返回不同的状态代码

python python-3.x web-scraping scrapy

对于具有相同头和方法的GET请求，scrapy从python请求返回不同的状态代码,python,python-3.x,web-scraping,scrapy,Python,Python 3.x,Web Scraping,Scrapy,有一段时间，我一直在使用cloudscraper包从具有CloudFlare保护的网站上抓取数据。这种方法最近停止工作。在调查该问题时，scrapy似乎失败了（获取503和验证码页面），而使用“requests.get”“successed”执行相同的请求，具有相同的标题复制： >>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba

有一段时间，我一直在使用cloudscraper包从具有CloudFlare保护的网站上抓取数据。这种方法最近停止工作。在调查该问题时，scrapy似乎失败了（获取503和验证码页面），而使用“requests.get”“successed”执行相同的请求，具有相同的标题

复制：

>>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:13:37 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94> (referer: None)

因此，我使用cloudscraper获取适当的头文件/cookie：

>>> import cloudscraper
>>> cs = cloudscraper.create_scraper()
>>> cs_response = cs.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:19:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:19:17 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
# Note: I do the below for the second time, so I can capture the request-headers for testing. I'm aware I can do this more efficiently
>>> cs_response = cs.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:25:19 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
>>> cs_response.request.headers
{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Cookie': '__cf_bm=03ac0515966d1267f42bd7d107562a0b9d5d3362-1606148357-1800-AYHPir56JSKk63KtQUR1vGzVipMSH3fHiDSDO/ireLEQ; __cfduid=dead0a14752f609eba026cd5f982dccec1606148356; wppas_pvbl=%5B%2252735%22%2C52736%5D; wppas_user_stats=%7B%221606089600%22%3A%7B%22impressions%22%3A%7B%22banners%22%3A%5B%2252735%22%2C52736%5D%7D%2C%22clicks%22%3A%7B%22banners%22%3A%5B%5D%7D%7D%7D'}

因此，如果此请求现在与这些cookie一起使用，我应该能够在scrapy中使用它们来获取数据：

>>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:29:49 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94> (referer: None)

如果我对python请求尝试相同的请求：

>>> requests.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:33:13 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:33:14 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
<Response [200]>

>>请求。获取（'https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94'，标头=cs_响应。请求。标头）
2020-11-23 16:33:13[urllib3.connectionpool]调试：启动新的HTTPS连接（1）：targetlaos.com:443
2020-11-23 16:33:14[urllib3.connectionpool]调试：https://targetlaos.com:443 “获取/分类/新闻/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1”200无

所以“相同”请求（或者：我期望完全相同的请求）在python请求中可以工作，但在scrapy中不能工作。有人知道为什么会这样吗

注意：我已经用Ubuntu服务器20.04.1和Python 3.7.9在AWS EC2实例上测试了上述内容，因为从我的本地IP我没有得到cloudflare页面

我所尝试的：

禁用所有下载程序中间件
Scrapy的最新版本（2.4.1）

您可以将两个请求指向并进行比较，以查看它们实际上有多相似。不过，有些网站可能无法根据请求内容以外的内容提供响应。看看最近的一期，它基本上提出了你的问题：嗨@Gallaecio，谢谢你的建议！我已经尝试过了，当将两个请求发送到Httpbin时，它们的头是相同的（有一个不同的“X-Amzn-Trace-Id”，但我相信这是在Httpbin一侧添加的）。我还尝试了链接问题中的建议，使用“DOWNLOADER\u CLIENT\u TLS\u METHOD='TLSv1.2'，但这也没有改变任何事情。CloudFlare采用的较低级别协议可能存在一些差异。如果你还有什么建议，请告诉我！您可以尝试使用，只是为了尝试其他东西。您可以将两个请求指向并进行比较，看看它们实际上有多相似。不过，有些网站可能无法根据请求内容以外的内容提供响应。看看最近的一期，它基本上提出了你的问题：嗨@Gallaecio，谢谢你的建议！我已经尝试过了，当将两个请求发送到Httpbin时，它们的头是相同的（有一个不同的“X-Amzn-Trace-Id”，但我相信这是在Httpbin一侧添加的）。我还尝试了链接问题中的建议，使用“DOWNLOADER\u CLIENT\u TLS\u METHOD='TLSv1.2'，但这也没有改变任何事情。CloudFlare采用的较低级别协议可能存在一些差异。如果你还有什么建议，请告诉我！你可以尝试一下，只是为了尝试别的东西。

>>> response.request.headers
{b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en-US,en;q=0.5'], b'Accept-Encoding': [b'gzip, deflate'], b'Cookie': [b'__cf_bm=03ac0515966d1267f42bd7d107562a0b9d5d3362-1606148357-1800-AYHPir56JSKk63KtQUR1vGzVipMSH3fHiDSDO/ireLEQ; __cfduid=dead0a14752f609eba026cd5f982dccec1606148356; wppas_pvbl=%5B%2252735%22%2C52736%5D; wppas_user_stats=%7B%221606089600%22%3A%7B%22impressions%22%3A%7B%22banners%22%3A%5B%2252735%22%2C52736%5D%7D%2C%22clicks%22%3A%7B%22banners%22%3A%5B%5D%7D%7D%7D']}

>>> requests.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:33:13 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:33:14 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
<Response [200]>