Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python,scrapy:从带有字符集iso-8859-1的scraped html页面的文件中写入了错误的utf8字符_Python_Python 2.7_Utf 8_Character Encoding_Scrapy - Fatal编程技术网

Python,scrapy:从带有字符集iso-8859-1的scraped html页面的文件中写入了错误的utf8字符

Python,scrapy:从带有字符集iso-8859-1的scraped html页面的文件中写入了错误的utf8字符,python,python-2.7,utf-8,character-encoding,scrapy,Python,Python 2.7,Utf 8,Character Encoding,Scrapy,我想用Python2.7中的Scrapy废弃一个带有charsetiso-8859-1的网页。我在网页上感兴趣的文本是:tempête Scrapy以UTF8 unicode格式返回响应,字符编码正确: >>> response u'temp\xc3\xaate' 现在,我想把单词tempête写在一个文件中,所以我要做以下工作: >>> import codecs >>> file = codecs.open('test', 'a', e

我想用Python2.7中的Scrapy废弃一个带有charset
iso-8859-1
的网页。我在网页上感兴趣的文本是:
tempête

Scrapy以UTF8 unicode格式返回响应,字符编码正确:

>>> response
u'temp\xc3\xaate'
现在,我想把单词
tempête
写在一个文件中,所以我要做以下工作:

>>> import codecs
>>> file = codecs.open('test', 'a', encoding='utf-8')
>>> file.write(response) #response is the above var
当我打开文件时,结果文本是
tempête
。python似乎没有检测到正确的编码,无法读取两个字节编码的字符,并认为这是两个一个编码的字符


如何处理这个简单的用例?

在您的示例中,
响应
是一个内部带有
\xc3\xa
的(已解码的)Unicode字符串,那么在scrapy编码检测级别就出了问题

\xc3\xa
是字符
ê
编码为UTF-8,因此您应该只看到(编码的)非Unicode/
str
字符串的那些字符(即在Python 2中)

Python 2.7 shell会话:

>>> # what your input should look like
>>> tempete = u'tempête'
>>> tempete
u'temp\xeate'

>>> # UTF-8 encoded
>>> tempete.encode('utf-8')
'temp\xc3\xaate'
>>>
>>> # latin1 encoded
>>> tempete.encode('iso-8859-1')
'temp\xeate'
>>> 

>>> # back to your sample
>>> s = u'temp\xc3\xaate'
>>> print s
tempête
>>>
>>> # if you use a non-Unicode string with those characters...
>>> s_raw = 'temp\xc3\xaate'
>>> s_raw.decode('utf-8')
u'temp\xeate'
>>> 
>>> # ... decoding from UTF-8 works
>>> print s_raw.decode('utf-8')
tempête
>>> 
将Scrapy页面解释为
iso-8859-1
encoded有问题

您可以通过从
response.body
重新构建响应来强制编码:

>>> import scrapy.http
>>> hr1 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='latin1')
>>> hr1.body_as_unicode()
u'<html><body>temp\xc3\xaate</body></html>'
>>> hr2 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='utf-8')
>>> hr2.body_as_unicode()
u'<html><body>temp\xeate</body></html>'
>>> 

然后使用
newresponse
而不是在您的示例中,
response
是一个内部带有
\xc3\xa
的(已解码的)Unicode字符串,那么在scrapy编码检测级别上有问题

\xc3\xa
是字符
ê
编码为UTF-8,因此您应该只看到(编码的)非Unicode/
str
字符串的那些字符(即在Python 2中)

Python 2.7 shell会话:

>>> # what your input should look like
>>> tempete = u'tempête'
>>> tempete
u'temp\xeate'

>>> # UTF-8 encoded
>>> tempete.encode('utf-8')
'temp\xc3\xaate'
>>>
>>> # latin1 encoded
>>> tempete.encode('iso-8859-1')
'temp\xeate'
>>> 

>>> # back to your sample
>>> s = u'temp\xc3\xaate'
>>> print s
tempête
>>>
>>> # if you use a non-Unicode string with those characters...
>>> s_raw = 'temp\xc3\xaate'
>>> s_raw.decode('utf-8')
u'temp\xeate'
>>> 
>>> # ... decoding from UTF-8 works
>>> print s_raw.decode('utf-8')
tempête
>>> 
将Scrapy页面解释为
iso-8859-1
encoded有问题

您可以通过从
response.body
重新构建响应来强制编码:

>>> import scrapy.http
>>> hr1 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='latin1')
>>> hr1.body_as_unicode()
u'<html><body>temp\xc3\xaate</body></html>'
>>> hr2 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='utf-8')
>>> hr2.body_as_unicode()
u'<html><body>temp\xeate</body></html>'
>>> 

而使用
newresponse
来代替在写入以
utf-8
格式打开的文件之前,您需要将
您的
响应
编码为
iso-8859-1
,然后将其解码为
utf-8

response = u'temp\xc3\xaate'
r1 = response.encode('iso-8859-1')
r2 = r1.decode('utf-8')

有趣的阅读:

在写入以
utf-8
打开的文件之前,您需要将
响应
编码为
iso-8859-1
,然后
解码(转换)为
utf-8

response = u'temp\xc3\xaate'
r1 = response.encode('iso-8859-1')
r2 = r1.decode('utf-8')
有趣的阅读:

您的回答不是“UTF8 Unicode”,而是Unicode。Scrapy错误地将UTF-8内容解析为iso-8859-1。你能再提供一点零碎的代码吗?你的回答不是“UTF8 Unicode”,而是Unicode。Scrapy错误地将UTF-8内容解析为iso-8859-1。你能再提供一点零碎的代码吗?