如何使用Python将包含cp1252字符的unicode字符串转换为UTF-8？_Python_Unicode_Encoding_Utf 8_Cp1252

如何使用Python将包含cp1252字符的unicode字符串转换为UTF-8？

python unicode encoding utf-8

如何使用Python将包含cp1252字符的unicode字符串转换为UTF-8？,python,unicode,encoding,utf-8,cp1252,Python,Unicode,Encoding,Utf 8,Cp1252,我通过API获取文本，该API返回带有windows编码撇号（\x92）的字符： python >>>title=u'There\x92六月有三十天 >>>头衔六月里有三十天 >>>印刷品标题六月有三十天 >>>类型（标题）我试图将这个字符串转换为UTF-8，这样它就会返回：“六月有三十天” 当我尝试解码或编码此unicode字符串时，它会抛出一个错误： >>> title.decode('cp1252') Traceback (most recent call las

我通过API获取文本，该API返回带有windows编码撇号（\x92）的字符：

python >>>title=u'There\x92六月有三十天 >>>头衔六月里有三十天 >>>印刷品标题六月有三十天 >>>类型（标题）我试图将这个字符串转换为UTF-8，这样它就会返回：“六月有三十天”

当我尝试解码或编码此unicode字符串时，它会抛出一个错误：

>>> title.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)

>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>

>标题.解码（'cp1252'）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
文件“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py”，第15行，解码
返回编解码器.charmap\u解码（输入、错误、解码表）
UnicodeEncodeError:“ascii”编解码器无法对位置5中的字符u'\x92'进行编码：序号不在范围内（128）
>>>标题.编码（“cp1252”）.解码（“utf-8”）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
文件“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py”，编码中的第12行
返回codecs.charmap\u encode（输入、错误、编码表）
UnicodeEncodeError:“charmap”编解码器无法对位置5中的字符u'\x92'进行编码：字符映射到

如果我将字符串初始化为纯文本，然后对其进行解码，它会起作用：

>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June

>>title='六月有三十天'
>>>类型（标题）
>>>打印标题。解码（'cp1252'）
六月有三十天

我的问题是如何将要获取的unicode字符串转换为纯文本字符串，以便对其进行解码？

您的字符串似乎是用

latin1

解码的（因为它属于

unicode

）

要将其转换回原来的字节，需要使用该编码（

latin1

）进行编码

然后要返回文本（
unicode
），必须使用适当的编解码器（
cp1252
）进行解码

最后，如果你想获得
utf-8
字节，你必须使用
utf-8
编解码器进行编码
代码：

>>> title = u'There\x92s thirty days in June' >>> title.encode('latin1') 'There\x92s thirty days in June' >>> title.encode('latin1').decode('cp1252') u'There\u2019s thirty days in June' >>> print(title.encode('latin1').decode('cp1252')) There’s thirty days in June >>> title.encode('latin1').decode('cp1252').encode('UTF-8') 'There\xe2\x80\x99s thirty days in June' >>> print(title.encode('latin1').decode('cp1252').encode('UTF-8')) There’s thirty days in June

根据API是采用文本（
unicode
）还是采用字节，3。可能不需要。
u'\x92'
是Unicode字符串中的专用字符
'\x92'
是cp1252编码字节字符串中的
右单引号。如果您使用的是Unicode，则API将字符串解码为Unicode的方式不正确。如果解码正确，它将是u'\u2019'。 >>> title = u'There\x92s thirty days in June' >>> title.encode('latin1') 'There\x92s thirty days in June' >>> title.encode('latin1').decode('cp1252') u'There\u2019s thirty days in June' >>> print(title.encode('latin1').decode('cp1252')) There’s thirty days in June >>> title.encode('latin1').decode('cp1252').encode('UTF-8') 'There\xe2\x80\x99s thirty days in June' >>> print(title.encode('latin1').decode('cp1252').encode('UTF-8')) There’s thirty days in June