Python，scrapy：从带有字符集iso-8859-1的scraped html页面的文件中写入了错误的utf8字符_Python_Python 2.7_Utf 8_Character Encoding_Scrapy

Python，scrapy：从带有字符集iso-8859-1的scraped html页面的文件中写入了错误的utf8字符

python python-2.7 utf-8 character-encoding scrapy

Python，scrapy：从带有字符集iso-8859-1的scraped html页面的文件中写入了错误的utf8字符,python,python-2.7,utf-8,character-encoding,scrapy,Python,Python 2.7,Utf 8,Character Encoding,Scrapy,我想用Python2.7中的Scrapy废弃一个带有charsetiso-8859-1的网页。我在网页上感兴趣的文本是：tempête Scrapy以UTF8 unicode格式返回响应，字符编码正确： >>> response u'temp\xc3\xaate' 现在，我想把单词tempête写在一个文件中，所以我要做以下工作： >>> import codecs >>> file = codecs.open('test', 'a', e

我想用Python2.7中的Scrapy废弃一个带有charset

iso-8859-1

的网页。我在网页上感兴趣的文本是：

tempête

Scrapy以UTF8 unicode格式返回响应，字符编码正确：

>>> response
u'temp\xc3\xaate'

现在，我想把单词

tempête

写在一个文件中，所以我要做以下工作：

>>> import codecs
>>> file = codecs.open('test', 'a', encoding='utf-8')
>>> file.write(response) #response is the above var

当我打开文件时，结果文本是

tempÃªte

。python似乎没有检测到正确的编码，无法读取两个字节编码的字符，并认为这是两个一个编码的字符

如何处理这个简单的用例？

在您的示例中，

响应

是一个内部带有

\xc3\xa

的（已解码的）Unicode字符串，那么在scrapy编码检测级别就出了问题

\xc3\xa

是字符

ê

编码为UTF-8，因此您应该只看到（编码的）非Unicode/

str

字符串的那些字符（即在Python 2中）

Python 2.7 shell会话：

>>> # what your input should look like
>>> tempete = u'tempête'
>>> tempete
u'temp\xeate'

>>> # UTF-8 encoded
>>> tempete.encode('utf-8')
'temp\xc3\xaate'
>>>
>>> # latin1 encoded
>>> tempete.encode('iso-8859-1')
'temp\xeate'
>>> 

>>> # back to your sample
>>> s = u'temp\xc3\xaate'
>>> print s
tempÃªte
>>>
>>> # if you use a non-Unicode string with those characters...
>>> s_raw = 'temp\xc3\xaate'
>>> s_raw.decode('utf-8')
u'temp\xeate'
>>> 
>>> # ... decoding from UTF-8 works
>>> print s_raw.decode('utf-8')
tempête
>>>

将Scrapy页面解释为

iso-8859-1

encoded有问题

您可以通过从

response.body

重新构建响应来强制编码：

>>> import scrapy.http
>>> hr1 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='latin1')
>>> hr1.body_as_unicode()
u'<html><body>temp\xc3\xaate</body></html>'
>>> hr2 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='utf-8')
>>> hr2.body_as_unicode()
u'<html><body>temp\xeate</body></html>'
>>>

然后使用

newresponse

而不是在您的示例中，

response