Python通用unicode_Python_Python 2.7_Unicode

Python通用unicode

python python-2.7 unicode

Python通用unicode,python,python-2.7,unicode,Python,Python 2.7,Unicode,我在理解Python2.7.2中的unicode时遇到问题，所以我在idle中尝试了一些测试。有两件事被标记为“不确定”。请告诉我他们为什么失败。至于其他项目，请告诉我我的评论是否准确 >>> s 'Don\x92t ' # s is a string >>> u u'Don\u2019t ' # u is a unicode object >>> type(u) # confirm u is unicode <type '

我在理解Python2.7.2中的unicode时遇到问题，所以我在idle中尝试了一些测试。有两件事被标记为“不确定”。请告诉我他们为什么失败。至于其他项目，请告诉我我的评论是否准确

>>> s
'Don\x92t '  # s is a string
>>> u
u'Don\u2019t '  # u is a unicode object
>>> type(u)     # confirm u is unicode
<type 'unicode'>
>>> type(s)     # confirm s is string
<type 'str'>
>>> type(s) == 'str' # wrong way to test
False
>>> isinstance(s, str)  # right way to test
True
>>> print s
Don’t       # works because idle can handle strings
>>> print u
Don’t       # works because idle can handle unicode
>>> open('9', 'w').write(s.encode('utf8')) #encode takes unicode, but s is a string,
                                            # so this fails
Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    open('9', 'w').write(s.encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> open('9', 'w').write(s) # write can write strings
>>> open('9', 'w').write(u) # write can't write unicode

Traceback (most recent call last):
  File "<pyshell#30>", line 1, in <module>
    open('9', 'w').write(u)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> open('9', 'w').write(u.encode('utf8'))  # encode turns unicode to string, which write can handle
>>> open('9', 'w').write(s.decode('utf8'))  # decode turns string to unicode, which write can't handle

Traceback (most recent call last):
  File "<pyshell#32>", line 1, in <module>
    open('9', 'w').write(s.decode('utf8'))
  File "C:\program files\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte
>>> e = '{}, {}'.format(s, u) # fails becase ''.format is string, while u is unicode

Traceback (most recent call last):
  File "<pyshell#33>", line 1, in <module>
    e = '{}, {}'.format(s, u)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> e = '{}, {}'.format(s, u.encode('utf8')) # works because u.encode is a string
>>> e = u'{}, {}'.format(s, u) # not sure

Traceback (most recent call last):
  File "<pyshell#36>", line 1, in <module>
    e = u'{}, {}'.format(s, u)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> e = u'{}, {}'.format(s.decode('utf8'), u) # not sure

Traceback (most recent call last):
  File "<pyshell#55>", line 1, in <module>
    e = u'{}, {}'.format(s.decode('utf8'), u)
  File "C:\program files\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte

>>> e = '\n'.join([s, u]) # wants strings, but u is unicode

Traceback (most recent call last):
  File "<pyshell#37>", line 1, in <module>
    e = '\n'.join([s, u])
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> e = '\n'.join([s, u.encode('utf8')]) # u.encode is now a string

>>s
“Don\x92t”是一个字符串
>>>u
u'Don\u2019t'#u是一个unicode对象
>>>输入（u）#确认u为unicode
>>>类型#确认s为字符串
>>>类型='str'#测试方法错误
假的
>>>isinstance（s，str）#正确的测试方法
真的
>>>印刷品
不要工作，因为idle可以处理字符串
>>>打印u
不要工作，因为idle可以处理unicode
>>>open（'9'，'w'）。write（s.encode（'utf8'））#encode采用unicode，但s是字符串，
#所以这失败了
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
打开（'9'，'w'）。写入（s.encode（'utf8'））
UnicodeDecodeError:“ascii”编解码器无法解码位置3中的字节0x92：序号不在范围内（128）
>>>打开（'9'，'w'）。写入#写入可以写入字符串
>>>打开（'9'，'w'）。写入（u）#写入无法写入unicode
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
打开（'9'，'w'）。写入（u）
UnicodeEncodeError:“ascii”编解码器无法对位置3中的字符u'\u2019'进行编码：序号不在范围内（128）
>>>open（'9'，'w'）。write（u.encode（'utf8'））#encode将unicode转换为字符串，写入可以处理该字符串
>>>open（'9'，'w'）。write（s.decode（'utf8'））#decode将字符串转换为unicode，这是write无法处理的
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
打开（'9'，'w'）。写入（s.解码（'utf8'））
文件“C:\program files\Python27\lib\encodings\utf_8.py”，第16行，解码
返回编解码器.utf_8_解码（输入，错误，真）
UnicodeDecodeError:“utf8”编解码器无法解码位置3中的字节0x92:无效的开始字节
>>>e='{}，{}'。格式（s，u）#失败，因为“”。格式为字符串，而u为unicode
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
e='{}，{}'。格式（s，u）
UnicodeEncodeError:“ascii”编解码器无法对位置3中的字符u'\u2019'进行编码：序号不在范围内（128）
>>>格式（s，u.encode（'utf8'））#可以工作，因为u.encode是一个字符串
>>>e=u'{}，{}。格式（s，u）#不确定
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
e=u'{}，{}'。格式（s，u）
UnicodeDecodeError:“ascii”编解码器无法解码位置3中的字节0x92：序号不在范围内（128）
>>>e=u'{}，{}.格式（s.decode（'utf8'），u）#不确定
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
e=u'{}，{}'。格式（s.decode（'utf8'），u）
文件“C:\program files\Python27\lib\encodings\utf_8.py”，第16行，解码
返回编解码器.utf_8_解码（输入，错误，真）
UnicodeDecodeError:“utf8”编解码器无法解码位置3中的字节0x92:无效的开始字节
>>>e='\n'.join（[s，u]）#需要字符串，但u是unicode
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
e='\n'.加入（[s，u]）
UnicodeDecodeError:“ascii”编解码器无法解码位置3中的字节0x92：序号不在范围内（128）
>>>e='\n'.join（[s，u.encode（'utf8'））#u.encode现在是一个字符串

首先

不是

utf-8

编码字符串，它可能是

cp1250

编码字符串。因此，使用utf-8对其进行解码总是失败的

>>> e = u'{}, {}'.format(s, u) # not sure

第一个“不确定”是因为

u'{}，{}

是

unicode

，并试图将

format

函数的每个参数编码为

unicode

字符串。但是因为它不知道

编码的是什么，它假设

编码为

ascii

，所以它尝试将其解码为

ascii

（基本上是执行

s.decode（'ascii'）

），并且失败，因为

是

cp1250

编码字符串

>>> e = u'{}, {}'.format(s.decode('utf8'), u) # not sure

第二个失败，因为您试图将其解码为

utf-8

，但正如前面所说，它实际上是在与

utf-8

不兼容的其他编码中，第一个

不是

utf-8

编码字符串，它可能是

cp1250

编码字符串。因此，使用utf-8对其进行解码总是失败的

>>> e = u'{}, {}'.format(s, u) # not sure

第一个“不确定”是因为

u'{}，{}

是

unicode

，并试图将

format

函数的每个参数编码为

unicode

字符串。但是因为它不知道

编码的是什么，它假设

编码为

ascii

，所以它尝试将其解码为

ascii

（基本上是执行

s.decode（'ascii'）

），并且失败，因为

是

cp1250

编码字符串

>>> e = u'{}, {}'.format(s.decode('utf8'), u) # not sure

第二个失败，因为您试图将其解码为

utf-8

，但正如前面所说，实际上它是在其他一些编码中，与

utf-8

不兼容。Python2将自动对Unicode值进行编码，或者在混合字符串和Unicode操作时对字符串值进行解码。这就是你困惑的根源

例如，在将Unicode值写入文件时，Python 2将尝试将该值编码为字符串。因为没有指定编码，所以使用默认编码，在Python2上是ASCII。在unicode上下文中使用

str

值也是如此，Python 2将使用ASCII编解码器对其进行解码

但是，示例值包含的代码点或字节不能表示为ASCII字符，因此自动转换失败。您看到的

UnicodeCodeError

或

UnicodeCodeError

异常是自动转换的结果

具体来说，

e=u'{}，{}。format（s，u）

尝试将

解码为Unicode，将其插入Unicode

u'{}，{}模板字符串中
为了避免自动转换，因此需要使用显式转换。要使用显式转换，您需要知道用于字节字符串的正确编码，或者在编码uni时，您的目标编解码器是什么