如何使用Python将包含cp1252字符的unicode字符串转换为UTF-8?
我通过API获取文本,该API返回带有windows编码撇号(\x92)的字符: python >>>title=u'There\x92六月有三十天 >>>头衔 六月里有三十天 >>>印刷品标题 六月有三十天 >>>类型(标题) 我试图将这个字符串转换为UTF-8,这样它就会返回:“六月有三十天” 当我尝试解码或编码此unicode字符串时,它会抛出一个错误:如何使用Python将包含cp1252字符的unicode字符串转换为UTF-8?,python,unicode,encoding,utf-8,cp1252,Python,Unicode,Encoding,Utf 8,Cp1252,我通过API获取文本,该API返回带有windows编码撇号(\x92)的字符: python >>>title=u'There\x92六月有三十天 >>>头衔 六月里有三十天 >>>印刷品标题 六月有三十天 >>>类型(标题) 我试图将这个字符串转换为UTF-8,这样它就会返回:“六月有三十天” 当我尝试解码或编码此unicode字符串时,它会抛出一个错误: >>> title.decode('cp1252') Traceback (most recent call las
>>> title.decode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)
>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>
>标题.解码('cp1252')
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py”,第15行,解码
返回编解码器.charmap\u解码(输入、错误、解码表)
UnicodeEncodeError:“ascii”编解码器无法对位置5中的字符u'\x92'进行编码:序号不在范围内(128)
>>>标题.编码(“cp1252”).解码(“utf-8”)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py”,编码中的第12行
返回codecs.charmap\u encode(输入、错误、编码表)
UnicodeEncodeError:“charmap”编解码器无法对位置5中的字符u'\x92'进行编码:字符映射到
如果我将字符串初始化为纯文本,然后对其进行解码,它会起作用:
>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June
>>title='六月有三十天'
>>>类型(标题)
>>>打印标题。解码('cp1252')
六月有三十天
我的问题是如何将要获取的unicode字符串转换为纯文本字符串,以便对其进行解码?您的字符串似乎是用latin1
解码的(因为它属于unicode
)
latin1
)进行编码unicode
),必须使用适当的编解码器(cp1252
)进行解码utf-8
字节,你必须使用utf-8
编解码器进行编码>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June
根据API是采用文本(
unicode
)还是采用字节,3。可能不需要。u'\x92'
是Unicode字符串中的专用字符'\x92'
是cp1252编码字节字符串中的右单引号。如果您使用的是Unicode,则API将字符串解码为Unicode的方式不正确。如果解码正确,它将是u'\u2019'
。
>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June