Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/c/72.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
ScrapySlash“;面具“;404_Scrapy_Scrapy Splash - Fatal编程技术网

ScrapySlash“;面具“;404

ScrapySlash“;面具“;404,scrapy,scrapy-splash,Scrapy,Scrapy Splash,我有一些问题,试图管理404响应与我的蜘蛛。看起来ScrapySlash用200屏蔽了404响应 这是我的密码 def buildRequest(self, url, dbid): request = Request(url, self.parse, meta={ 'splash': { 'args':{ 'html': 1,

我有一些问题,试图管理404响应与我的蜘蛛。看起来ScrapySlash用200屏蔽了404响应

这是我的密码

def buildRequest(self, url, dbid):
     request = Request(url, self.parse, meta={
                  'splash': {
                      'args':{
                          'html': 1,
                          'wait': 5
                          },
                      'magic_response':True,
                      },
                 'dbId': dbid
                  }, errback=self.errback_httpbin, dont_filter=True)
     return request
一个简单的
打印响应。状态总是显示200。使用
scrapy shell测试我的url将显示
response

当我使用请求对象时,我的爬行器将转到
self.errback\u httpbin
方法,但使用SpaslRequest不会。SlashRequest正确处理502,但不能正确处理404


谢谢

看来您只能通过将
/execute
响应与“magic responses”(默认情况下处于启用状态)结合使用来实现这一点:

meta['splash']['magic_response']
-设置为True和JSON时 从Splash接收响应,这是响应的几个属性 (标题、正文、url、状态代码)使用中返回的数据填充 JSON:

  • response.headers由“
    headers
    ”键填充
  • 已设置response.url 指向“
    url
    ”键的值
  • response.body设置为“
    html
    ”的值 键,或“body”键的base64解码值
  • response.status已设置 输入“
    http\u status
    ”键的值。(……)
如果使用
SplashRequest
,此选项默认设置为
True

其他端点,如
/render.html
/render.json
将为来自远程服务器的4xx和5xx响应返回502个坏网关(待检查)

在此基础上:

(注意末尾的表,返回url、标题、http_状态、html和cookies。)

。。。当您将此脚本与
/execute
SplashRequest
和errback一起使用时,可以复制:

使用scrapy 1.3运行,您将得到以下结果:

$ scrapy crawl errback_example
2017-01-11 18:07:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: test404)
(...)
2017-01-11 18:07:20 [scrapy.core.engine] INFO: Spider opened
(...)
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 1 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404 via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [errback_example] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2017-01-11 18:07:21 [errback_example] ERROR: HttpError on http://www.httpbin.org/status/404
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 2 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/ via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 3 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [errback_example] INFO: Got successful response from http://www.httpbin.org/
2017-01-11 18:07:21 [errback_example] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2017-01-11 18:07:21 [errback_example] ERROR: HttpError on http://www.httpbin.org/status/500
2017-01-11 18:07:21 [scrapy.core.engine] INFO: Closing spider (finished)
2017-01-11 18:07:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5365,
 'downloader/request_count': 5,
 'downloader/request_method_count/POST': 5,
 'downloader/response_bytes': 17332,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/400': 4,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 1, 11, 17, 7, 21, 715440),
 'log_count/DEBUG': 7,
 'log_count/ERROR': 4,
 'log_count/INFO': 8,
 'response_received_count': 3,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'splash/execute/request_count': 3,
 'splash/execute/response_count/200': 1,
 'splash/execute/response_count/400': 4,
 'start_time': datetime.datetime(2017, 1, 11, 17, 7, 20, 683232)}
2017-01-11 18:07:21 [scrapy.core.engine] INFO: Spider closed (finished)
$scrapy crawl errback\u示例
2017-01-11 18:07:20[scrapy.utils.log]信息:scrapy 1.3.0已启动(bot:test404)
(...)
2017-01-11 18:07:20[刮屑.堆芯.发动机]信息:卡盘已打开
(...)
2017-01-11 18:07:21[scrapy.DownloaderMiddleware.retry]调试:重试(失败1次):500个内部服务器错误
2017-01-11 18:07:21[scrapy.core.engine]调试:爬网(404)(参考:无)
2017-01-11 18:07:21[错误示例]错误:
2017-01-11 18:07:21[errback_示例]错误:HttpError onhttp://www.httpbin.org/status/404
2017-01-11 18:07:21[scrapy.downloadermiddleware.retry]调试:重试(失败2次):500内部服务器错误
2017-01-11 18:07:21[刮屑核心引擎]调试:爬网(200)(参考:无)
2017-01-11 18:07:21[scrapy.DownloaderMiddleware.retry]调试:放弃重试(失败3次):500内部服务器错误
2017-01-11 18:07:21[刮屑核心引擎]调试:爬网(500)(参考:无)
2017-01-11 18:07:21[errback_示例]信息:获得来自的成功响应http://www.httpbin.org/
2017-01-11 18:07:21[错误示例]错误:
2017-01-11 18:07:21[errback_示例]错误:HttpError onhttp://www.httpbin.org/status/500
2017-01-11 18:07:21[刮屑芯发动机]信息:关闭卡盘(已完成)
2017-01-11 18:07:21[scrapy.StatCollectors]信息:倾销scrapy统计数据:
{'downloader/request_bytes':5365,
“下载程序/请求计数”:5,
“下载程序/请求方法/计数/发布”:5,
“downloader/response_字节”:17332,
“下载程序/响应计数”:5,
“下载程序/响应状态\计数/200”:1,
“下载程序/响应状态\计数/400”:4,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2017,1,11,17,7,21,715440),
“日志计数/调试”:7,
“日志计数/错误”:4,
“日志计数/信息”:8,
“收到的响应数”:3,
“调度程序/出列”:8,
“调度程序/出列/内存”:8,
“调度程序/排队”:8,
“调度程序/排队/内存”:8,
“启动/执行/请求计数”:3,
“启动/执行/响应计数/200”:1,
“启动/执行/响应计数/400”:4,
“开始时间”:datetime.datetime(2017,1,11,17,7,20683232)}
2017-01-11 18:07:21[刮屑堆芯发动机]信息:十字轴关闭(完成)

调用errback时,
[errback\u example]ERROR
行显示错误,即,这里您可以看到404和500通过errback方法传递。

非常感谢
import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield SplashRequest(u, callback=self.parse_httpbin,
                                   errback=self.errback_httpbin,
                                   endpoint='execute',
                                   args={'lua_source': script})

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)
$ scrapy crawl errback_example
2017-01-11 18:07:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: test404)
(...)
2017-01-11 18:07:20 [scrapy.core.engine] INFO: Spider opened
(...)
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 1 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404 via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [errback_example] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2017-01-11 18:07:21 [errback_example] ERROR: HttpError on http://www.httpbin.org/status/404
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 2 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/ via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 3 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [errback_example] INFO: Got successful response from http://www.httpbin.org/
2017-01-11 18:07:21 [errback_example] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2017-01-11 18:07:21 [errback_example] ERROR: HttpError on http://www.httpbin.org/status/500
2017-01-11 18:07:21 [scrapy.core.engine] INFO: Closing spider (finished)
2017-01-11 18:07:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5365,
 'downloader/request_count': 5,
 'downloader/request_method_count/POST': 5,
 'downloader/response_bytes': 17332,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/400': 4,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 1, 11, 17, 7, 21, 715440),
 'log_count/DEBUG': 7,
 'log_count/ERROR': 4,
 'log_count/INFO': 8,
 'response_received_count': 3,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'splash/execute/request_count': 3,
 'splash/execute/response_count/200': 1,
 'splash/execute/response_count/400': 4,
 'start_time': datetime.datetime(2017, 1, 11, 17, 7, 20, 683232)}
2017-01-11 18:07:21 [scrapy.core.engine] INFO: Spider closed (finished)