Python 给定Unicode代码点列表，如何将其拆分为Unicode字符列表？_Python_Python 3.x_Unicode_Grapheme

Python 给定Unicode代码点列表，如何将其拆分为Unicode字符列表？

python python-3.x unicode

Python 给定Unicode代码点列表，如何将其拆分为Unicode字符列表？,python,python-3.x,unicode,grapheme,Python,Python 3.x,Unicode,Grapheme,我正在为Unicode文本编写一个词法分析器。许多Unicode字符即使在标准组合之后也需要多个代码点。例如，tuplemapord，unicodedata.normalize'NFC'，'́'，计算结果为257769。我怎么知道两个字符之间的边界在哪里？另外，我想存储文本的非规范化版本。我的输入保证是Unicode 到目前为止，我的情况如下： from unicodedata import normalize def split_into_characters(text): char

我正在为Unicode文本编写一个词法分析器。许多Unicode字符即使在标准组合之后也需要多个代码点。例如，tuplemapord，unicodedata.normalize'NFC'，'́'，计算结果为257769。我怎么知道两个字符之间的边界在哪里？另外，我想存储文本的非规范化版本。我的输入保证是Unicode

到目前为止，我的情况如下：

from unicodedata import normalize

def split_into_characters(text):
    character = ""
    characters = []

    for i in range(len(text)):
        character += text[i]

        if len(normalize('NFKC', character)) > 1:
            characters.append(character[:-1])
            character = character[-1]

    if len(character) > 0:
        characters.append(character)

    return characters

print(split_into_characters('Puélla in vī́llā vīcī́nā hábitat.'))

这会错误地打印以下内容：

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī', '́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī', '́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

我希望它能打印以下内容：

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī', '́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī', '́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

感知到的字符之间的边界可以用Unicode标识。Python的unicodedata模块没有Grapheme_Cluster_Break属性所需的算法数据，但是可以在和等库中找到完整的实现。

Oh sweet。这两家公司都有相当宽松的许可证。完美的谢谢