Python 从urllib查询返回的数据编码不正确_Python_Python 3.x_Utf 8_Character Encoding

Python 从urllib查询返回的数据编码不正确

python python-3.x utf-8 character-encoding

Python 从urllib查询返回的数据编码不正确,python,python-3.x,utf-8,character-encoding,Python,Python 3.x,Utf 8,Character Encoding,我从谷歌图片中抓取了一些数据，发现像“î”这样的字母被错误地解码了。在这种情况下，“î”变为“Ã®”。我将谷歌查询的数据存储在一个对象中，其格式如下： {"key":"value"} 但是，字典的值可以包含其他字符，例如： {"key":"File:Blue tit (Cyanistes caeruleus), Parc du Rouge-Cloître, Brussels ( 32781868883).jpg"} 当我收到表格中的数据时 {"key":"File:Blue tit (Cya

我从谷歌图片中抓取了一些数据，发现像“î”这样的字母被错误地解码了。在这种情况下，“î”变为“Ã®”。我将谷歌查询的数据存储在一个对象中，其格式如下：

{"key":"value"}

但是，字典的值可以包含其他字符，例如：

{"key":"File:Blue tit (Cyanistes caeruleus), Parc du Rouge-Cloître, Brussels ( 32781868883).jpg"}

当我收到表格中的数据时

{"key":"File:Blue tit (Cyanistes caeruleus), Parc du Rouge-Clo\xc3\xaetre, Brussels ( 32781868883).jpg"}

因此，当我尝试将其转换为字节并使用以下方法解码时：

decoded_obj=字节（原始_obj，'utf-8'）。decode（'unicode_转义'）

我得到输出

{"key":"File:Blue tit (Cyanistes caeruleus), Parc du Rouge-CloÃ®tre, Brussels ( 32781868883).jpg"}

刮刀代码如下所示：

导入urllib.request
导入json
url='1〕https://www.google.com/search?q=Blue+tit+（蓝岩+蓝岩），+Parc+du+Rouge Clo%C3%AEtre，+布鲁塞尔+（32781868883）。jpg&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiE8866stfjAhWBolwKHQ1YCdQQ\U AUIESgB&biw=1920&bih=937'
标题={}
标题['User-Agent']=“Mozilla/5.0（Windows NT 10.0；Win64；x64）AppleWebKit/537.36（KHTML，如Gecko）Chrome/75.0.3770.100 Safari/537.36”
request=urllib.request.request（url，headers=headers）
response=urllib.request.urlopen（请求）
data=str（response.read（））
start\u line=data.find（'class=“rg\u meta nottranslate”>'））
start_obj=data.find（“{”，start_行+1）
结束对象=数据。查找（“”，开始对象+1）
原始对象=str（数据[开始对象：结束对象]）
decoded_obj=字节（原始_obj，'utf-8'）。decode（'unicode_转义'）
final_obj=json.load（解码的_obj）
打印（最终版）

响应数据由UTF-8编码字节组成：

>>> response = urllib.request.urlopen(request)
>>> res = response.read()
>>> type(res)
<class 'bytes'>
>>> response.headers
<http.client.HTTPMessage object at 0x7ff6ea74ba90>
>>> response.headers['Content-type']                                                                                           
'text/html; charset=UTF-8'

一旦完成此操作，

data

就是一个

str

，无需进一步解码或编码（或

str（）

或

bytes（）

调用）

通常，在

bytes

实例上调用

str

是错误的，除非提供适当的编码：

>>> s = 'spam'
>>> bs = s.encode('utf-8')
>>> str(bs)
"b'spam'"   # Now 'b' is inside the string
>>> 

>>> str(bs, encoding='utf-8')
'spam'

因此，您的字符串中包含十六进制值。问题是如何将这些十六进制值转换为unicode，或者在您的例子中，由于带扬抑符的i，解码格式应该是latin-1，而不是unicode转义。

>>> s = 'spam'
>>> bs = s.encode('utf-8')
>>> str(bs)
"b'spam'"   # Now 'b' is inside the string
>>> 

>>> str(bs, encoding='utf-8')
'spam'