使用Python和请求抓取网页时的字符集问题_Python_Encoding_Character Encoding

使用Python和请求抓取网页时的字符集问题

python encoding character-encoding

使用Python和请求抓取网页时的字符集问题,python,encoding,character-encoding,Python,Encoding,Character Encoding,尝试下载中文页面时（根据meta标记，显示为gb2312）。在我运行下面的代码并在gEdit中以gb2312格式打开文件后，我得到了乱码符号，例如×××（ò），其中的汉字应该是下面是有问题的页面的源代码：-实际的站点仅用于教育机构我的代码： r = requests.post("http://example.com", data=payload, cookies=cookies) f = open('myfile.txt', 'w') f.write(r.text.encode('gb231

尝试下载中文页面时（根据meta标记，显示为gb2312）。在我运行下面的代码并在gEdit中以gb2312格式打开文件后，我得到了乱码符号，例如×××（ò），其中的汉字应该是

下面是有问题的页面的源代码：-实际的站点仅用于教育机构

我的代码：

r = requests.post("http://example.com", data=payload, cookies=cookies)
f = open('myfile.txt', 'w')
f.write(r.text.encode('gb2312',errors="ignore"))
f.close()

页面的标题：

{'content-length'：'6164'，'x-powered-by'：'ASP.NET'，'date'：'2013年3月11日星期一05:11:24 GMT'，'cache-control'：'private'，'content-type'：'text/html'，'server'：'Microsoft IIS/6.0'}

如果我尝试解码而不是编码，在Python中会出现以下错误：

UnicodeEncodeError:“ascii”编解码器无法对2017-2018位置的字符进行编码：序号不在范围内（128）

djc@enraihttp$python
Python 2.7.3（默认值，2012年6月18日09:39:59）
[GCC 4.5.3]关于linux2
有关详细信息，请键入“帮助”、“版权”、“信用证”或“许可证”。
>>>导入URL库
>>>rsp=urllib.urlopen（'https://gist.github.com/anonymous/27663069655db7fd7a19/raw/836a5c55d0f87a2fa5edcc9a14097c945452f520/chinese.html）。读（）
>>>进口chardet
>>>字符检测（rsp）
{'confidence'：0.99，'encoding'：'utf-8'}
>>>rsp.解码（'utf-8'）
u'\n（剪断）\n'

所以，我想，不要相信字符集标题？

djc@enraihttp$python
Python 2.7.3（默认值，2012年6月18日09:39:59）
[GCC 4.5.3]关于linux2
有关详细信息，请键入“帮助”、“版权”、“信用证”或“许可证”。
>>>导入URL库
>>>rsp=urllib.urlopen（'https://gist.github.com/anonymous/27663069655db7fd7a19/raw/836a5c55d0f87a2fa5edcc9a14097c945452f520/chinese.html）。读（）
>>>进口chardet
>>>字符检测（rsp）
{'confidence'：0.99，'encoding'：'utf-8'}
>>>rsp.解码（'utf-8'）
u'\n（剪断）\n'

所以，我想，不要相信charset头球

f.write(r.text.decode('gb2312',errors="ignore"))

djc@enrai http $ python
Python 2.7.3 (default, Jun 18 2012, 09:39:59)
[GCC 4.5.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> rsp = urllib.urlopen('https://gist.github.com/anonymous/27663069655db7fd7a19/raw/836a5c55d0f87a2fa5edcc9a14097c945452f520/chinese.html').read()
>>> import chardet
>>> chardet.detect(rsp)
{'confidence': 0.99, 'encoding': 'utf-8'}
>>> rsp.decode('utf-8')
u'\n<HTML><HEAD>(snip)</BODY></HTML>\n'