Python 如何正确制表unicode数据_Python_Unicode

Python 如何正确制表unicode数据

python unicode

Python 如何正确制表unicode数据,python,unicode,Python,Unicode,（我使用的是python 2.7）我有这个测试： # -*- coding: utf-8 -*- import binascii test_cases = [ 'aaaaa', # Normal bytestring 'ááááá', # Normal bytestring, but with extended ascii. Since the file is utf-8 encoded, this is utf-8 encoded 'ℕℤℚℝℂ',

（我使用的是python 2.7）

我有这个测试：

# -*- coding: utf-8 -*-

import binascii

test_cases = [
    'aaaaa',    # Normal bytestring
    'ááááá',    # Normal bytestring, but with extended ascii. Since the file is utf-8 encoded, this is utf-8 encoded
    'ℕℤℚℝℂ',    # Encoded unicode. The editor has encoded this, and it is defined as string, so it is left encoded by python
    u'aaaaa',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
    u'ááááá',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
    u'ℕℤℚℝℂ',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
]
FORMAT = '%-20s -> %2d %-20s %-30s %-30s'
for data in test_cases :
    try:
        hexlified = binascii.hexlify(data)
    except:
        hexlified = None
    print FORMAT % (data, len(data), type(data), hexlified, repr(data))

它产生输出：

aaaaa                ->  5 <type 'str'>         6161616161                     'aaaaa'                       
ááááá           -> 10 <type 'str'>         c3a1c3a1c3a1c3a1c3a1           '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'
ℕℤℚℝℂ      -> 15 <type 'str'>         e28495e284a4e2849ae2849de28482 '\xe2\x84\x95\xe2\x84\xa4\xe2\x84\x9a\xe2\x84\x9d\xe2\x84\x82'
aaaaa                ->  5 <type 'unicode'>     6161616161                     u'aaaaa'                      
ááááá                ->  5 <type 'unicode'>     None                           u'\xe1\xe1\xe1\xe1\xe1'       
ℕℤℚℝℂ                ->  5 <type 'unicode'>     None                           u'\u2115\u2124\u211a\u211d\u2102'

aaaaa->56161“aaaaa”
ááá->10 c3a1c3a1c3a1c3a1c3a1'\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xc3\xa1'
ℕℤℚℝℂ      -> 15 E28495E284A4E2849AE849DE28482'\xe2\x84\x95\xe2\x84\xa4\xe2\x84\x9a\xe2\x84\x9d\xe2\x84\x82'
AAAA->56161 u“AAAA”
ááá->5无u'\xe1\xe1\xe1\xe1\xe1'
ℕℤℚℝℂ                ->  5无u'\u2115\u2124\u211a\u211d\u2102'

如您所见，对于带有非ascii字符的字符串，列没有正确对齐。这是因为这些字符串的长度（以字节为单位）大于unicode字符数。填充字段时，如何让print考虑字符数而不是字节数

当python 2.7看到

'ℕℤℚℝℂ'它显示“这里有15个任意字节值”。它不知道它们代表什么字符，也不知道它们代表什么编码。您需要将此字节字符串解码为unicode字符串，并指定编码，然后python才能计数字符：
for data in test_cases :
    if isinstance(data, bytes):
        data = data.decode('utf-8')
    print FORMAT % (data, len(data), type(data), repr(data))

请注意，与Python3不同，默认情况下，所有字符串文本都是unicode
对象
首先使用字符，而不是字节。