Python 在中间件中获取代理响应_Python_Scrapy_Middleware

Python 在中间件中获取代理响应

python scrapy

Python 在中间件中获取代理响应,python,scrapy,middleware,Python,Scrapy,Middleware,我的中间件中的scrapy存在以下问题：我使用https向站点发出请求，并使用代理。定义中间件并在其中使用process\u response时，response.headers只包含来自网站的标题。有没有办法从代理隧道建立的连接请求中获取头文件？我们使用的代理在这个响应中添加了一些信息作为头，我们希望在中间件中使用它。我发现在TunnelingTCP4ClientEndpoint.processProxyResponse中，参数rcvd\u bytes包含我需要的所有信息。我没有找到在我的

我的中间件中的scrapy存在以下问题：

我使用https向站点发出请求，并使用代理。定义中间件并在其中使用

process\u response

时，

response.headers

只包含来自网站的标题。有没有办法从代理隧道建立的连接请求中获取头文件？我们使用的代理在这个响应中添加了一些信息作为头，我们希望在中间件中使用它。我发现在

TunnelingTCP4ClientEndpoint.processProxyResponse

中，参数

rcvd\u bytes

包含我需要的所有信息。我没有找到在我的中间件中获取

rcvd\u字节的方法
我还发现了一个一年前的类似问题，但没有解决：
以下是代理网站的示例：
对于HTTPS，IP位于5.6.7.8的代理对等IP的连接响应头x-hola-IP示例中：
Request
CONNECT example.com:80 HTTP/1.1
Host: example.com:80
Accept: */*

Response:
HTTP/1.1 200 OK
Content-Type: text/html
x-hola-ip: 5.6.7.8

我想在这个例子中得到x-hola-ip
当使用curl-likecurl--proxy-mysuperproxy时https://stackoverflow.com
CONNECT响应中也有正确的数据
如果这是不可能的，我可能的解决方案是到目前为止以某种方式对类进行猴子补丁，或者您知道一个更好的python解决方案
提前感谢你的帮助
注意：我也在scrapy的github上发布了这个问题，如果我找到任何解决方案，我将更新这两个站点：）
借助Matthew的工作解决方案：
from scrapy.core.downloader.handlers.http11 import (
    HTTP11DownloadHandler, ScrapyAgent, TunnelingTCP4ClientEndpoint, TunnelError, TunnelingAgent
)
from scrapy import twisted_version

class MyHTTPDownloader(HTTP11DownloadHandler):
    i = ''
    def download_request(self, request, spider):
        # we're just overriding here to monkey patch the attribute
        agent = ScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
            maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
            warnsize=getattr(spider, 'download_warnsize', self._default_warnsize),
            fail_on_dataloss=self._fail_on_dataloss)


        agent._TunnelingAgent = MyTunnelingAgent

        return agent.download_request(request)

class MyTunnelingAgent(TunnelingAgent):
    if twisted_version >= (15, 0, 0):
        def _getEndpoint(self, uri):
            return MyTunnelingTCP4ClientEndpoint(
                self._reactor, uri.host, uri.port, self._proxyConf,
                self._contextFactory, self._endpointFactory._connectTimeout,
                self._endpointFactory._bindAddress)
    else:
        def _getEndpoint(self, scheme, host, port):
            return MyTunnelingTCP4ClientEndpoint(
                self._reactor, host, port, self._proxyConf,
                self._contextFactory, self._connectTimeout,
                self._bindAddress)

class MyTunnelingTCP4ClientEndpoint(TunnelingTCP4ClientEndpoint):
    def processProxyResponse(self, rcvd_bytes):
        # log('hier rcvd_bytes')
        MyHTTPDownloader.i = rcvd_bytes
        return super(MyTunnelingTCP4ClientEndpoint, self).processProxyResponse(rcvd_bytes)

在您的设置中：
DOWNLOAD_HANDLERS = {
    'http': 'crawler.MyHTTPDownloader.MyHTTPDownloader',
    'https': 'crawler.MyHTTPDownloader.MyHTTPDownloader',
}

我看到Scrapinghub的人说他们不太可能添加这个特性，并建议创建一个自定义子类来获得您想要的行为。因此，考虑到这一点：
我相信在创建子类之后，您可以通过设置http
和https
键来告诉scrapy使用它来指向您的子类
请记住，我没有一个本地http代理发送额外的头进行测试，所以这只是我认为需要进行的一个“餐巾纸草图”：
from scrapy.core.downloader.handlers.http11 import (
    HTTP11DownloadHandler, ScrapyAgent, TunnelingAgent,
)

class MyHTTPDownloader(HTTP11DownloadHandler):
    def download_request(self, request, spider):
        # we're just overriding here to monkey patch the attribute
        ScrapyAgent._TunnelingAgent = MyTunnelingAgent
        return super(MyHTTPDownloader, self).download_request(request, spider)

class MyTunnelingAgent(TunnelingAgent):
    # ... and here is where it would get weird

这最后一点让我感动，因为我相信我已经清楚地了解了需要覆盖的方法来捕获所需的字节，但我头脑中没有足够的扭曲框架来知道将它们放置在何处，以便将它们暴露于返回到爬行器的响应。
嗨，伯纳德，欢迎来到这里！我想祝贺你，这可能是我所见过的新撰稿人提出的最好的问题，所以谢谢你。作为一项建议，将来我会添加在您的问题中创建的链接。祝你好运谢谢你的提示，这里是：这帮助我创造了一个工作的解决方案，我可以在我的情况下使用很多。我将在问题中添加工作代码。在我的例子中，可以向MyHTTPDownloader添加一个静态变量，我也可以在中间件中访问该变量。