Python 如何正确制表unicode数据
(我使用的是python 2.7) 我有这个测试:Python 如何正确制表unicode数据,python,unicode,Python,Unicode,(我使用的是python 2.7) 我有这个测试: # -*- coding: utf-8 -*- import binascii test_cases = [ 'aaaaa', # Normal bytestring 'ááááá', # Normal bytestring, but with extended ascii. Since the file is utf-8 encoded, this is utf-8 encoded 'ℕℤℚℝℂ',
# -*- coding: utf-8 -*-
import binascii
test_cases = [
'aaaaa', # Normal bytestring
'ááááá', # Normal bytestring, but with extended ascii. Since the file is utf-8 encoded, this is utf-8 encoded
'ℕℤℚℝℂ', # Encoded unicode. The editor has encoded this, and it is defined as string, so it is left encoded by python
u'aaaaa', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
u'ááááá', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
u'ℕℤℚℝℂ', # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
]
FORMAT = '%-20s -> %2d %-20s %-30s %-30s'
for data in test_cases :
try:
hexlified = binascii.hexlify(data)
except:
hexlified = None
print FORMAT % (data, len(data), type(data), hexlified, repr(data))
它产生输出:
aaaaa -> 5 <type 'str'> 6161616161 'aaaaa'
ááááá -> 10 <type 'str'> c3a1c3a1c3a1c3a1c3a1 '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'
ℕℤℚℝℂ -> 15 <type 'str'> e28495e284a4e2849ae2849de28482 '\xe2\x84\x95\xe2\x84\xa4\xe2\x84\x9a\xe2\x84\x9d\xe2\x84\x82'
aaaaa -> 5 <type 'unicode'> 6161616161 u'aaaaa'
ááááá -> 5 <type 'unicode'> None u'\xe1\xe1\xe1\xe1\xe1'
ℕℤℚℝℂ -> 5 <type 'unicode'> None u'\u2115\u2124\u211a\u211d\u2102'
aaaaa->56161“aaaaa”
ááá->10 c3a1c3a1c3a1c3a1c3a1'\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xc3\xa1'
ℕℤℚℝℂ -> 15 E28495E284A4E2849AE849DE28482'\xe2\x84\x95\xe2\x84\xa4\xe2\x84\x9a\xe2\x84\x9d\xe2\x84\x82'
AAAA->56161 u“AAAA”
ááá->5无u'\xe1\xe1\xe1\xe1\xe1'
ℕℤℚℝℂ -> 5无u'\u2115\u2124\u211a\u211d\u2102'
如您所见,对于带有非ascii字符的字符串,列没有正确对齐。这是因为这些字符串的长度(以字节为单位)大于unicode字符数。填充字段时,如何让print考虑字符数而不是字节数 当python 2.7看到
'ℕℤℚℝℂ'代码>它显示“这里有15个任意字节值”。它不知道它们代表什么字符,也不知道它们代表什么编码。您需要将此字节字符串解码为unicode字符串,并指定编码,然后python才能计数字符:
for data in test_cases :
if isinstance(data, bytes):
data = data.decode('utf-8')
print FORMAT % (data, len(data), type(data), repr(data))
请注意,与Python3不同,默认情况下,所有字符串文本都是unicode
对象首先使用字符,而不是字节。