Python HTMLPasser或urllib2 unicode问题_Python_Unicode

Python HTMLPasser或urllib2 unicode问题

python unicode

Python HTMLPasser或urllib2 unicode问题,python,unicode,Python,Unicode,我正在尝试使用HTMLPasser和urllib2来获取图像文件 content = urllib2.urlopen( imgurl.encode('utf-8') ).read() try: p = MyHTMLParser( ) p.feed( content ) p.download_file( ) p.close() except Exception,e: print e MyHtmlPasser： class MyHTMLParser(HTM

我正在尝试使用HTMLPasser和urllib2来获取图像文件

content = urllib2.urlopen( imgurl.encode('utf-8') ).read()
try:
    p = MyHTMLParser(  )
    p.feed( content )
    p.download_file( )
    p.close()
except Exception,e:
    print e

MyHtmlPasser：

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)        
        self.url=""
        self.outfile = "some.png"

    def download_file(self):
        urllib.urlretrieve( self.url, self.outfile )

    def handle_starttag(self, tag, attrs):
        if tag == "a":
           # after some manipulation here, self.url will have a img url
           self.url = "http://somewhere.com/Fondue%C3%A0.png"

当我运行脚本时，我得到

Traceback (most recent call last):
File "test.py", line 59, in <module>
p.feed( data )
File "/usr/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 158, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 305, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 56: ordinal not in range(128)

回溯（最近一次呼叫最后一次）：
文件“test.py”，第59行，在
p、 提要（数据）
文件“/usr/lib/python2.7/HTMLParser.py”，第114行，在提要中
自我激励（0）
goahead中的文件“/usr/lib/python2.7/HTMLParser.py”，第158行
k=自我分析（i）
parse_starttag中的文件“/usr/lib/python2.7/HTMLParser.py”，第305行
attrvalue=self.unescape（attrvalue）
unescape中的文件“/usr/lib/python2.7/HTMLParser.py”，第472行
返回RER.sub（r“&（#？[xX]？（？：[0-9a-fA-F]+|\w{1,8}））；”，替换实体，s）
文件“/usr/lib/python2.7/re.py”，第151行，子文件
return\u compile（模式、标志）.sub（repl、字符串、计数）
UnicodeDecodeError:“ascii”编解码器无法解码第56位的字节0xc3:序号不在范围内（128）

使用我在find中找到的建议，我使用了.encode（'utf-8'）方法，但它仍然给我错误。如何解决这个问题？谢谢

更换

content = urllib2.urlopen( url.encode('utf-8') ).read()

与

将响应解码为unicode。

该错误消息应带有文件名和行号，以指示出现错误的行。这是哪一行？@Sammusmann，我加上了实际的追踪。thanksHTML解析器解析HTML。这是真实的图像。为什么不下载文件？@Blender，实际上，img url是在我在运行时进行了一些字符串操作之后出现的。所以我一开始就不知道确切的url。@dorothy:但问题仍然存在：是

url

指向HTML页面（如），还是PNG图像的url（如）？@dorothy:utf-8不是HTML文档中唯一可以使用的字符编码。它可以在html（例如，

）中的http头中指定（例如，

内容类型：text/html；charset=utf-8

），

content = urllib2.urlopen(url).read().decode('utf-8')