如何在Python中跟踪元刷新_Python_Redirect_Refresh_Urllib2_Httplib2

如何在Python中跟踪元刷新

python redirect

如何在Python中跟踪元刷新,python,redirect,refresh,urllib2,httplib2,Python,Redirect,Refresh,Urllib2,Httplib2,Python的urllib2遵循3xx重定向以获取最终内容。有没有一种方法可以让urllib2（或其他一些库，如）也遵循这种方法？或者我需要为刷新元标记手动解析HTML吗？使用BeautifulSoup或lxml解析HTML。好的，似乎没有库支持它，所以我一直在使用以下代码： import urllib2 import urlparse import re def get_hops(url): redirect_re = re.compile('<meta[^>]*?url

Python的urllib2遵循3xx重定向以获取最终内容。有没有一种方法可以让urllib2（或其他一些库，如）也遵循这种方法？或者我需要为刷新元标记手动解析HTML吗？

使用BeautifulSoup或lxml解析HTML。

好的，似乎没有库支持它，所以我一直在使用以下代码：

import urllib2
import urlparse
import re

def get_hops(url):
    redirect_re = re.compile('<meta[^>]*?url=(.*?)["\']', re.IGNORECASE)
    hops = []
    while url:
        if url in hops:
            url = None
        else:
            hops.insert(0, url)
            response = urllib2.urlopen(url)
            if response.geturl() != url:
                hops.insert(0, response.geturl())
            # check for redirect meta tag
            match = redirect_re.search(response.read())
            if match:
                url = urlparse.urljoin(url, match.groups()[0].strip())
            else:
                url = None
    return hops

导入urllib2
导入URL解析
进口稀土
def获取跳数（url）：
重定向\u re=re.compile（']*？url=（.*）['\']'，re.IGNORECASE）
啤酒花=[]
而url：
如果url位于跃点中：
url=无
其他：
hops.insert（0，url）
response=urlib2.urlopen（url）
如果response.geturl（）！=url:
hops.insert（0，response.geturl（））
#检查重定向元标记
match=重定向搜索（response.read（））
如果匹配：
url=urlparse.urljoin（url，match.groups（）[0].strip（））
其他：
url=无
返回啤酒花

下面是一个使用BeautifulSoup和httplib2（以及基于证书的身份验证）的解决方案：

一个使用请求和lxml库的类似解决方案。还可以简单地检查被测试的东西是否是HTML（我的实现中的一个要求）。还可以通过使用请求库的会话捕获和使用cookie（如果重定向+cookie用作防刮机制，有时是必要的）

用法：

s = requests.session()
r = s.get(url)
# test for and follow meta redirects
r = follow_redirections(r, s)

如果您不想使用bs4，可以这样使用lxml：

from lxml.html import soupparser

def meta_redirect(content):
    root = soupparser.fromstring(content)
    result_url = root.xpath('//meta[@http-equiv="refresh"]/@content')
    if result_url:
        result_url = str(result_url[0])
        urls = result_url.split('URL=') if len(result_url.split('url=')) < 2    else result_url.split('url=')
        url = urls[1] if len(urls) >= 2 else None
    else:
        return None
    return url

从lxml.html导入soupparser
def meta_重定向（内容）：
root=soupparser.fromstring（内容）
result_url=root.xpath（'//meta[@http equiv=“refresh”]/@content'）
如果结果\u url：
结果url=str（结果url[0]）
如果len（result\u url.split（'url='））小于2，则url=result\u url.split（'url='），如果len（result\u url.split（'url='）），则url=result\u url.split（'url='））
url=url[1]如果len（url）>=2，则为无
其他：
一无所获
返回url

使用HTML解析器仅提取meta refresh标记有点过火，至少对我来说是这样。我希望有一个Python HTTP库可以自动执行此操作。这是一个HTML标记，因此您不太可能在HTTP库中找到此功能。有时meta refresh重定向指向相对URL。例如例如，Facebook确实

。检测相对URL并预先设置方案和主机会很好。@JosephMornin:Adapted。我意识到它仍然不支持循环重定向……但总是有些东西。

s = requests.session()
r = s.get(url)
# test for and follow meta redirects
r = follow_redirections(r, s)

from lxml.html import soupparser

def meta_redirect(content):
    root = soupparser.fromstring(content)
    result_url = root.xpath('//meta[@http-equiv="refresh"]/@content')
    if result_url:
        result_url = str(result_url[0])
        urls = result_url.split('URL=') if len(result_url.split('url=')) < 2    else result_url.split('url=')
        url = urls[1] if len(urls) >= 2 else None
    else:
        return None
    return url