Python 如何在urllib2请求中获取默认标头?

Python 如何在urllib2请求中获取默认标头?,python,urllib2,Python,Urllib2,我有一个使用urllib2的Python web客户端。将HTTP头添加到传出请求中非常简单。我只需要创建一个我想要添加的头的字典,并将其传递给请求初始值设定项 但是,其他“标准”HTTP头会添加到请求中,以及我显式添加的自定义头。当我使用Wireshark嗅探请求时,除了我自己添加的头之外,我还会看到头。我的问题是我如何访问这些标题?我想记录每个请求(包括完整的HTTP头集),但不知道如何记录 有什么建议吗 简而言之:如何从urllib2创建的HTTP请求中获取所有传出头?请参见urllib2

我有一个使用urllib2的Python web客户端。将HTTP头添加到传出请求中非常简单。我只需要创建一个我想要添加的头的字典,并将其传递给请求初始值设定项

但是,其他“标准”HTTP头会添加到请求中,以及我显式添加的自定义头。当我使用Wireshark嗅探请求时,除了我自己添加的头之外,我还会看到头。我的问题是我如何访问这些标题?我想记录每个请求(包括完整的HTTP头集),但不知道如何记录

有什么建议吗

简而言之:如何从urllib2创建的HTTP请求中获取所有传出头?

请参见urllib2.py:do_请求(第1044(1067)行)和urllib2.py:do_打开(第1073行)
(第293行)self.addheaders=[('User-agent',client_version)](仅添加了'User-agent')

它应在您指定的http头旁边发送默认http头(如所指定)。您可以使用这样的工具,如果您希望看到它们的完整性

编辑:

如果您想记录它们,可以使用来捕获特定应用程序(在您的示例中是python)发送的数据包。您还可以指定数据包的类型和许多其他详细信息


-John

urllib2库使用OpenerDirectory对象来处理实际的打开。幸运的是,python库提供了默认值,因此您不必这样做。但是,正是这些OpenerDirectory对象添加了额外的头

要在发送请求后查看它们是什么(例如,您可以将其记录下来),请执行以下操作:

undirected_hdrs是openerdirector转储额外头的地方。只需查看
req.headers
就可以只显示您自己的头-库会为您保留这些头

如果在发送请求之前需要查看头,则需要对OpenerDirectory进行子类化,以便拦截传输

希望有帮助

编辑:我忘了提到,一旦发送请求,
req.header\u items()
将为您提供所有头的元组列表,包括您自己的头和OpenerDirectory添加的头。我应该先提到这一点,因为这是最直接的:-)对不起

编辑2:在您提出关于定义自己的处理程序的示例的问题之后,下面是我提出的示例。对请求链进行任何欺骗都需要考虑的是,我们需要确保处理程序对于多个请求是安全的,这就是为什么我不愿意直接在HTTPConnection类上替换putheader的定义

遗憾的是,由于HTTPConnection和AbstractHTTPHandler的内部结构非常内部,我们必须从python库中复制大部分代码来注入我们的自定义行为。假设我没有在下面胡闹,并且这和我在5分钟的测试中做的一样好,如果您将Python版本更新为修订号(即:2.5.x到2.5.y或2.5到2.6,等等),请小心重新访问此覆盖

因此,我应该提到我使用的是Python 2.5.1。如果您有2.6,特别是3.0,则可能需要相应地进行调整

如果这不起作用,请告诉我。我对这个问题太感兴趣了:

import urllib2
import httplib
import socket


class CustomHTTPConnection(httplib.HTTPConnection):

    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.stored_headers = []

    def putheader(self, header, value):
        self.stored_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)


class HTTPCaptureHeaderHandler(urllib2.AbstractHTTPHandler):

    def http_open(self, req):
        return self.do_open(CustomHTTPConnection, req)

    http_request = urllib2.AbstractHTTPHandler.do_request_

    def do_open(self, http_class, req):
        # All code here lifted directly from the python library
        host = req.get_host()
        if not host:
            raise URLError('no host given')

        h = http_class(host) # will parse host:port
        h.set_debuglevel(self._debuglevel)

        headers = dict(req.headers)
        headers.update(req.unredirected_hdrs)
        headers["Connection"] = "close"
        headers = dict(
            (name.title(), val) for name, val in headers.items())
        try:
            h.request(req.get_method(), req.get_selector(), req.data, headers)
            r = h.getresponse()
        except socket.error, err: # XXX what error?
            raise urllib2.URLError(err)
        r.recv = r.read
        fp = socket._fileobject(r, close=True)

        resp = urllib2.addinfourl(fp, r.msg, req.get_full_url())
        resp.code = r.status
        resp.msg = r.reason

        # This is the line we're adding
        req.all_sent_headers = h.stored_headers
        return resp

my_handler = HTTPCaptureHeaderHandler()
opener = urllib2.OpenerDirector()
opener.add_handler(my_handler)
req = urllib2.Request(url='http://www.google.com')

resp = opener.open(req)

print req.all_sent_headers

shows: [('Accept-Encoding', 'identity'), ('Host', 'www.google.com'), ('Connection', 'close'), ('User-Agent', 'Python-urllib/2.5')]

像这样的怎么样:

import urllib2
import httplib

old_putheader = httplib.HTTPConnection.putheader
def putheader(self, header, value):
    print header, value
    old_putheader(self, header, value)
httplib.HTTPConnection.putheader = putheader

urllib2.urlopen('http://www.google.com')

如果您想查看发送的文本HTTP请求,从而查看每一个与线路上表示的完全相同的最后一个标头,那么您可以告诉
urllib2
使用自己版本的
HTTPHandler
,打印(或保存)输出HTTP请求

import httplib, urllib2

class MyHTTPConnection(httplib.HTTPConnection):
    def send(self, s):
        print s  # or save them, or whatever!
        httplib.HTTPConnection.send(self, s)

class MyHTTPHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        return self.do_open(MyHTTPConnection, req)

opener = urllib2.build_opener(MyHTTPHandler)
response = opener.open('http://www.google.com/')
运行此代码的结果是:

GET / HTTP/1.1
Accept-Encoding: identity
Host: www.google.com
Connection: close
User-Agent: Python-urllib/2.6

在我看来,您似乎在寻找响应对象的标题,其中包括
Connection:close
,等等。这些标题位于urlopen返回的对象中。找到它们很容易:

from urllib2 import urlopen
req = urlopen("http://www.google.com")
print req.headers.headers
req.headers
是底层解决方案的一个实例:

导入httplib
类HTTPConnection2(httplib.HTTPConnection):
定义初始化(self,*args,**kwargs):
httplib.HTTPConnection.\uuuuu init\uuuuuu(self,*args,**kwargs)
self._请求_头=[]
self.\u请求\u头=无
def putheader(自身、标题、值):
self.\u请求\u头.append((头,值))
httplib.HTTPConnection.putheader(self、header、value)
def发送(自身):
self.\u请求\u头=s
httplib.HTTPConnection.send(self,s)
def getresponse(self、*args、**kwargs):
response=httplib.HTTPConnection.getresponse(self,*args,**kwargs)
response.request\u headers=self.\u request\u headers
response.request\u header=self.\u request\u header
返回响应
例如:

conn = HTTPConnection2("www.python.org")
conn.request("GET", "/index.html", headers={
    "User-agent": "test",
    "Referer": "/",
})
response = conn.getresponse()
响应.状态,响应.原因:

1: 200 OK
response.request\u标题:

[('Host', 'www.python.org'), ('Accept-Encoding', 'identity'), ('Referer', '/'), ('User-agent', 'test')]
response.request_头:

GET /index.html HTTP/1.1
Host: www.python.org
Accept-Encoding: identity
Referer: /
User-agent: test

另一种解决方案,witch使用了std库中的想法,但没有复制std库中的代码:

class HTTPConnection2(httplib.HTTPConnection):
    """
    Like httplib.HTTPConnection but stores the request headers.
    Used in HTTPConnection3(), see below.
    """
    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.request_headers = []
        self.request_header = ""

    def putheader(self, header, value):
        self.request_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)

    def send(self, s):
        self.request_header = s
        httplib.HTTPConnection.send(self, s)


class HTTPConnection3(object):
    """
    Wrapper around HTTPConnection2
    Used in HTTPHandler2(), see below.
    """
    def __call__(self, *args, **kwargs):
        """
        instance made in urllib2.HTTPHandler.do_open()
        """
        self._conn = HTTPConnection2(*args, **kwargs)
        self.request_headers = self._conn.request_headers
        self.request_header = self._conn.request_header
        return self

    def __getattribute__(self, name):
        """
        Redirect attribute access to the local HTTPConnection() instance.
        """
        if name == "_conn":
            return object.__getattribute__(self, name)
        else:
            return getattr(self._conn, name)


class HTTPHandler2(urllib2.HTTPHandler):
    """
    A HTTPHandler which stores the request headers.
    Used HTTPConnection3, see above.

    >>> opener = urllib2.build_opener(HTTPHandler2)
    >>> opener.addheaders = [("User-agent", "Python test")]
    >>> response = opener.open('http://www.python.org/')

    Get the request headers as a list build with HTTPConnection.putheader():
    >>> response.request_headers
    [('Accept-Encoding', 'identity'), ('Host', 'www.python.org'), ('Connection', 'close'), ('User-Agent', 'Python test')]

    >>> response.request_header
    'GET / HTTP/1.1\\r\\nAccept-Encoding: identity\\r\\nHost: www.python.org\\r\\nConnection: close\\r\\nUser-Agent: Python test\\r\\n\\r\\n'
    """
    def http_open(self, req):
        conn_instance = HTTPConnection3()
        response = self.do_open(conn_instance, req)
        response.request_headers = conn_instance.request_headers
        response.request_header = conn_instance.request_header
        return response

编辑:更新源代码

我需要从Python程序中记录它们,这样WinPcap就不会帮我了。不过谢谢。是的,会的。你读过它是什么或如何使用它吗?它与wireshark程序一起使用,它显示您分析了数据包的输出并能够记录它们。数据包包含头,我认为这是显而易见的。您可以在应用程序中调用/合并winpcap。winpcap适用于windows。我的应用程序运行所有平台。这也是太多的开销。不过谢谢你的建议。这非常接近我需要的。唯一的问题是当我在循环中调用它时,它会不断附加重复的头。JUSTUS,这太接近了。。如果你有其他想法,你能更新你的答案吗?我不明白你所说的“循环”是什么意思。但是,考虑到这需要这么多黑客,我想知道为什么需要这么多日志记录。您最好使用一个http代理,让它完成所有日志记录,并使用urllib与之通信
class HTTPConnection2(httplib.HTTPConnection):
    """
    Like httplib.HTTPConnection but stores the request headers.
    Used in HTTPConnection3(), see below.
    """
    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.request_headers = []
        self.request_header = ""

    def putheader(self, header, value):
        self.request_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)

    def send(self, s):
        self.request_header = s
        httplib.HTTPConnection.send(self, s)


class HTTPConnection3(object):
    """
    Wrapper around HTTPConnection2
    Used in HTTPHandler2(), see below.
    """
    def __call__(self, *args, **kwargs):
        """
        instance made in urllib2.HTTPHandler.do_open()
        """
        self._conn = HTTPConnection2(*args, **kwargs)
        self.request_headers = self._conn.request_headers
        self.request_header = self._conn.request_header
        return self

    def __getattribute__(self, name):
        """
        Redirect attribute access to the local HTTPConnection() instance.
        """
        if name == "_conn":
            return object.__getattribute__(self, name)
        else:
            return getattr(self._conn, name)


class HTTPHandler2(urllib2.HTTPHandler):
    """
    A HTTPHandler which stores the request headers.
    Used HTTPConnection3, see above.

    >>> opener = urllib2.build_opener(HTTPHandler2)
    >>> opener.addheaders = [("User-agent", "Python test")]
    >>> response = opener.open('http://www.python.org/')

    Get the request headers as a list build with HTTPConnection.putheader():
    >>> response.request_headers
    [('Accept-Encoding', 'identity'), ('Host', 'www.python.org'), ('Connection', 'close'), ('User-Agent', 'Python test')]

    >>> response.request_header
    'GET / HTTP/1.1\\r\\nAccept-Encoding: identity\\r\\nHost: www.python.org\\r\\nConnection: close\\r\\nUser-Agent: Python test\\r\\n\\r\\n'
    """
    def http_open(self, req):
        conn_instance = HTTPConnection3()
        response = self.do_open(conn_instance, req)
        response.request_headers = conn_instance.request_headers
        response.request_header = conn_instance.request_header
        return response