对于具有相同头和方法的GET请求,scrapy从python请求返回不同的状态代码

对于具有相同头和方法的GET请求,scrapy从python请求返回不同的状态代码,python,python-3.x,web-scraping,scrapy,Python,Python 3.x,Web Scraping,Scrapy,有一段时间,我一直在使用cloudscraper包从具有CloudFlare保护的网站上抓取数据。这种方法最近停止工作。在调查该问题时,scrapy似乎失败了(获取503和验证码页面),而使用“requests.get”“successed”执行相同的请求,具有相同的标题 复制: >>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba

有一段时间,我一直在使用cloudscraper包从具有CloudFlare保护的网站上抓取数据。这种方法最近停止工作。在调查该问题时,scrapy似乎失败了(获取503和验证码页面),而使用“requests.get”“successed”执行相同的请求,具有相同的标题

复制:

>>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:13:37 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94> (referer: None)
因此,我使用cloudscraper获取适当的头文件/cookie:

>>> import cloudscraper
>>> cs = cloudscraper.create_scraper()
>>> cs_response = cs.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:19:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:19:17 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
# Note: I do the below for the second time, so I can capture the request-headers for testing. I'm aware I can do this more efficiently
>>> cs_response = cs.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94')
2020-11-23 16:25:19 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
>>> cs_response.request.headers
{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Cookie': '__cf_bm=03ac0515966d1267f42bd7d107562a0b9d5d3362-1606148357-1800-AYHPir56JSKk63KtQUR1vGzVipMSH3fHiDSDO/ireLEQ; __cfduid=dead0a14752f609eba026cd5f982dccec1606148356; wppas_pvbl=%5B%2252735%22%2C52736%5D; wppas_user_stats=%7B%221606089600%22%3A%7B%22impressions%22%3A%7B%22banners%22%3A%5B%2252735%22%2C52736%5D%7D%2C%22clicks%22%3A%7B%22banners%22%3A%5B%5D%7D%7D%7D'}
因此,如果此请求现在与这些cookie一起使用,我应该能够在scrapy中使用它们来获取数据:

>>> fetch('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:29:49 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94> (referer: None)
如果我对python请求尝试相同的请求:

>>> requests.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:33:13 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:33:14 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
<Response [200]>
>>请求。获取('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94',标头=cs_响应。请求。标头)
2020-11-23 16:33:13[urllib3.connectionpool]调试:启动新的HTTPS连接(1):targetlaos.com:443
2020-11-23 16:33:14[urllib3.connectionpool]调试:https://targetlaos.com:443 “获取/分类/新闻/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1”200无
所以“相同”请求(或者:我期望完全相同的请求)在python请求中可以工作,但在scrapy中不能工作。有人知道为什么会这样吗

注意:我已经用Ubuntu服务器20.04.1和Python 3.7.9在AWS EC2实例上测试了上述内容,因为从我的本地IP我没有得到cloudflare页面

我所尝试的:

  • 禁用所有下载程序中间件
  • Scrapy的最新版本(2.4.1)

您可以将两个请求指向并进行比较,以查看它们实际上有多相似。不过,有些网站可能无法根据请求内容以外的内容提供响应。看看最近的一期,它基本上提出了你的问题:嗨@Gallaecio,谢谢你的建议!我已经尝试过了,当将两个请求发送到Httpbin时,它们的头是相同的(有一个不同的“X-Amzn-Trace-Id”,但我相信这是在Httpbin一侧添加的)。我还尝试了链接问题中的建议,使用“DOWNLOADER\u CLIENT\u TLS\u METHOD='TLSv1.2',但这也没有改变任何事情。CloudFlare采用的较低级别协议可能存在一些差异。如果你还有什么建议,请告诉我!您可以尝试使用,只是为了尝试其他东西。您可以将两个请求指向并进行比较,看看它们实际上有多相似。不过,有些网站可能无法根据请求内容以外的内容提供响应。看看最近的一期,它基本上提出了你的问题:嗨@Gallaecio,谢谢你的建议!我已经尝试过了,当将两个请求发送到Httpbin时,它们的头是相同的(有一个不同的“X-Amzn-Trace-Id”,但我相信这是在Httpbin一侧添加的)。我还尝试了链接问题中的建议,使用“DOWNLOADER\u CLIENT\u TLS\u METHOD='TLSv1.2',但这也没有改变任何事情。CloudFlare采用的较低级别协议可能存在一些差异。如果你还有什么建议,请告诉我!你可以尝试一下,只是为了尝试别的东西。
>>> response.request.headers
{b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en-US,en;q=0.5'], b'Accept-Encoding': [b'gzip, deflate'], b'Cookie': [b'__cf_bm=03ac0515966d1267f42bd7d107562a0b9d5d3362-1606148357-1800-AYHPir56JSKk63KtQUR1vGzVipMSH3fHiDSDO/ireLEQ; __cfduid=dead0a14752f609eba026cd5f982dccec1606148356; wppas_pvbl=%5B%2252735%22%2C52736%5D; wppas_user_stats=%7B%221606089600%22%3A%7B%22impressions%22%3A%7B%22banners%22%3A%5B%2252735%22%2C52736%5D%7D%2C%22clicks%22%3A%7B%22banners%22%3A%5B%5D%7D%7D%7D']}
>>> requests.get('https://targetlaos.com/category/news/%e0%ba%82%e0%bb%88%e0%ba%b2%e0%ba%a7-%e0%ba%9e%e0%ba%b2%e0%ba%8d%e0%bb%83%e0%ba%99%e0%ba%9b%e0%ba%b0%e0%bb%80%e0%ba%97%e0%ba%94', headers=cs_response.request.headers)
2020-11-23 16:33:13 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): targetlaos.com:443
2020-11-23 16:33:14 [urllib3.connectionpool] DEBUG: https://targetlaos.com:443 "GET /category/news/%E0%BA%82%E0%BB%88%E0%BA%B2%E0%BA%A7-%E0%BA%9E%E0%BA%B2%E0%BA%8D%E0%BB%83%E0%BA%99%E0%BA%9B%E0%BA%B0%E0%BB%80%E0%BA%97%E0%BA%94 HTTP/1.1" 200 None
<Response [200]>