Python 按字节计数正确拆分unicode字符串_Python_Unicode_Utf 8

Python 按字节计数正确拆分unicode字符串

python unicode utf-8

Python 按字节计数正确拆分unicode字符串,python,unicode,utf-8,Python,Unicode,Utf 8,我想将unicode字符串拆分为最多255个字节字符，并以unicode形式返回结果： # s = arbitrary-length-unicode-string s.encode('utf-8')[:255].decode('utf-8') 此代码段的问题是，如果第255个字节字符是2字节unicode字符的一部分，我将得到错误： UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 254: unexpect

我想将unicode字符串拆分为最多255个字节字符，并以unicode形式返回结果：

# s = arbitrary-length-unicode-string
s.encode('utf-8')[:255].decode('utf-8')

此代码段的问题是，如果第255个字节字符是2字节unicode字符的一部分，我将得到错误：

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 254: unexpected end of data

即使我处理了这个错误，我也会在字符串末尾得到不需要的垃圾

如何更优雅地解决这个问题？

UTF-8的一个非常好的特性是可以很容易地区分尾随字节和起始字节。只需向后操作，直到删除一个起始字节

trunc_s = s.encode('utf-8')[:256]
if len(trunc_s) > 255:
    final = -1
    while ord(trunc_s[final]) & 0xc0 == 0x80:
        final -= 1
    trunc_s = trunc_s[:final]
trunc_s = trunc_s.decode('utf-8')

编辑：检查问题中被确认为重复的答案。

我以前见过这个问题的答案；让我给你找个傻瓜，你说得对。就在这里：@theta:那就更容易了-P