Python 获取URL时出现UnicodeDeer错误_Python_Google App Engine

Python 获取URL时出现UnicodeDeer错误

python google-app-engine

Python 获取URL时出现UnicodeDeer错误,python,google-app-engine,Python,Google App Engine,我正在使用urlfetch获取URL。当我尝试将其发送到html2text函数（去掉所有HTML标记）时，我得到以下消息： UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined> 以及错误消息： File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode ret

我正在使用urlfetch获取URL。当我尝试将其发送到html2text函数（去掉所有HTML标记）时，我得到以下消息：

UnicodeEncodeError: 'charmap' codec can't encode characters in position  ... character maps to <undefined>

以及错误消息：

File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>

文件“C:\Python26\lib\encodings\cp1252.py”，第12行，在encode中
返回codecs.charmap\u encode（输入、错误、编码表）
UnicodeEncodeError:“charmap”编解码器无法对位置159-165中的字符进行编码：字符映射到

您需要对首先获取的数据进行解码！使用哪种编解码器？取决于您获取的网站

当您使用unicode并尝试使用一些unicode.encode（'utf-8'，'ignore'）对其进行编码时，我无法想象它会如何抛出错误

好的，您需要做什么：

result = fetch('http://google.com') 
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8

这不是很健壮，但它应该为您指明方向。

@Joel:您需要解码的编解码器要么在HTTP头中，要么在HTML元标记中（或者未指定，那么您必须猜测）。谷歌是一个不好的例子，因为你的网站取决于你的居住地：请添加

content_type=result.headers.getheader（'content-type'）；将（内容类型）

打印到您的代码中（在

result=urlfetch.fetch（…）

之后），并告诉我们结果。输出为：“windows-1255”。我尝试切换到html2text（result.content.decode（'windows-1255'，'ignore'）），但仍然得到“UnicodeEncodeError:'charmap'编解码器无法对位置2-8中的字符进行编码：字符映射到”

result = fetch('http://google.com') 
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8