Python lxml.html和Unicode：提取链接_Python_Html_Unicode_Utf 8_Lxml

Python lxml.html和Unicode：提取链接

python html unicode utf-8

Python lxml.html和Unicode：提取链接,python,html,unicode,utf-8,lxml,Python,Html,Unicode,Utf 8,Lxml,下面的代码从网页中提取链接并在浏览器中显示。有很多UTF-8编码的网页，这很好用。但例如，法语维基百科页面就产生了一个错误 # -*- coding: utf-8 -*- print 'Content-Type: text/html; charset=utf-8\n' print '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head> <meta http-equiv="c

下面的代码从网页中提取链接并在浏览器中显示。有很多UTF-8编码的网页，这很好用。但例如，法语维基百科页面就产生了一个错误

# -*- coding: utf-8 -*-

print 'Content-Type: text/html; charset=utf-8\n'
print '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Show Links</title>
</head>
<body>'''

import urllib2, lxml.html as lh

def load_page(url):
    headers = {'User-Agent' : 'Mozilla/5.0 (compatible; testbot/0.1)'}
    try:
        req = urllib2.Request(url, None, headers)
        response = urllib2.urlopen(req)
        page = response.read()
        return page
    except:
        print '<b>Couldn\'t load:', url, '</b><br>'
        return None

def show_links(page):
    tree = lh.fromstring(page)
    for node in tree.xpath('//a'):
        if 'href' in node.attrib:
            url = node.attrib['href']
            if '#' in url:
                url=url.split('#')[0]
            if '@' not in url and 'javascript' not in url:
                if node.text:
                    linktext = node.text
                else:
                    linktext = '-'
                print '<a href="%s">%s</a><br>' % (url, linktext.encode('utf-8'))

page = load_page('http://fr.wikipedia.org/wiki/%C3%89tats_unis')
show_links(page)

print '''
</body>
</html>
'''

#-*-编码：utf-8-*-
打印内容类型：文本/html；字符集=utf-8\n'
打印“”'
显示链接
'''
将urllib2、lxml.html作为lh导入
def加载页面（url）：
headers={'User-Agent'：'Mozilla/5.0（兼容；testbot/0.1）}
尝试：
请求（url、无、标题）
响应=urllib2.urlopen（请求）
page=response.read（）
返回页
除：
打印“无法加载：”，url“
”
一无所获
def显示链接（第页）：
tree=lh.fromstring（第页）
对于tree.xpath（'//a'）中的节点：
如果node.attrib中的“href”：
url=node.attrib['href']
如果url中有“#”：
url=url.split（“#”）[0]
如果“@”不在url中且“javascript”不在url中：
如果node.text：
linktext=node.text
其他：
链接文本='-'
打印“
”%（url，linktext.encode（'utf-8'））
页面=加载页面（'http://fr.wikipedia.org/wiki/%C3%89tats_unis')
显示链接（第页）
打印“”'
'''

我得到以下错误：

Traceback (most recent call last):
  File "C:\***\question.py", line 42, in <module>
    show_links(page)
  File "C:\***\question.py", line 39, in show_links
    print '<a href="%s">%s</a><br>' % (url, linktext.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

回溯（最近一次呼叫最后一次）：
文件“C:\***\question.py”，第42行，在
显示链接（第页）
文件“C:\***\question.py”，第39行，在show\u链接中
打印“
”%（url，linktext.encode（'utf-8'））
UnicodeDecodeError:“ascii”编解码器无法解码位置3中的字节0xc3:序号不在范围内（128）

我的系统：Python 2.6（Windows）、lxml 2.3.3、Apache服务器（显示结果）

我做错了什么？

lxml返回的是bytestrings而不是unicode。在编码为utf-8之前，最好使用页面提供的任何编码将bytestring解码为unicode

如果您的文本已经是utf-8格式，则无需进行任何编码或解码-只需执行该操作即可

但是，如果链接文本的类型是

unicode

（正如您所说的），那么它就是一个unicode字符串（每个元素代表一个unicode码点），并且utf-8编码应该可以很好地工作

我怀疑问题在于您的

url

字符串也是一个unicode字符串，在被替换到bytestring之前，它还需要编码为utf-8。

lxml返回的是bytestring而不是unicode。在编码为utf-8之前，最好使用页面提供的任何编码将bytestring解码为unicode

如果您的文本已经是utf-8格式，则无需进行任何编码或解码-只需执行该操作即可

但是，如果链接文本的类型是

unicode

（正如您所说的），那么它就是一个unicode字符串（每个元素代表一个unicode码点），并且utf-8编码应该可以很好地工作

我怀疑问题在于您的

url

字符串也是一个unicode字符串，在被替换到bytestring之前，它还需要编码为utf-8。

您也需要编码

url

问题可能类似于：

>>> "%s%s" % (u"", "€ <-non-ascii char in a bytestring")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in
range(128)

>>“%s%s”%（u“）”€您也需要对url进行编码
问题可能类似于：
>>> "%s%s" % (u"", "€ <-non-ascii char in a bytestring")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in
range(128)

>>“%s%s”%（u“）”€在失败的行之前添加一行打印报告（linktext）
，并编辑您的问题以显示结果。添加一行打印报告（linktext）
就在失败行之前，编辑您的问题以显示结果。感谢您的快速回复！Wikipedia页面采用utf-8编码。linktext.decode（'utf-8'）。encode（'utf-8'））也给了我一个错误。顺便说一句：为什么链接文本是？@renardvolant你的第三句话丢了。无论如何，试试我的建议，我想你会发现它解决了你的问题。@Marcin：也谢谢你。我给了J.F.Sebastian“接受的答案”，因为他快了2分钟；-）@renardvolant可以通过投票系统表达您的感激之情。@Marcin:我的声誉分数低于15。所以我无法投票给您的答案。下次！谢谢您的快速回复！维基百科页面的编码为utf-8.linktext.decode（'utf-8'）。encode（'utf-8'））也给了我一个错误。顺便说一句：为什么链接文本是？@renardvolant你的第三句话丢了。无论如何，试试我的建议，我想你会发现它解决了你的问题。@Marcin：也谢谢你。我给了J.F.Sebastian“接受的答案”，因为他快了2分钟；-）@renardvolant可以通过投票系统表达您的感激之情。@Marcin:我的声誉分数低于15分。所以我不能推翻你的答案。下次