组合变音符号不使用unicodedata.normalize（PYTHON）进行规范化_Python_Unicode_Replace_Diacritics

组合变音符号不使用unicodedata.normalize（PYTHON）进行规范化

python unicode replace

组合变音符号不使用unicodedata.normalize（PYTHON）进行规范化,python,unicode,replace,diacritics,Python,Unicode,Replace,Diacritics,我知道，unicodedata.normalize将变音符号转换为非变音符号： import unicodedata ''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf') if unicodedata.category(c) != 'Mn' ) 我的问题是（在本例中可以看到）：unicodedata是否有办法将组合字符变音符号替换为对应的字符变音符号？（u‘œ’变为‘oe’）如果不

我知道，

unicodedata.normalize

将变音符号转换为非变音符号：

import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf') 
            if unicodedata.category(c) != 'Mn'
       )

我的问题是（在本例中可以看到）：unicodedata是否有办法将组合字符变音符号替换为对应的字符变音符号？（u‘œ’变为‘oe’）

如果不是的话，我想我将不得不对这些进行抨击，但是我还是用所有Uchar和它们的对应者编译我自己的dict，然后完全忘记

unicodedata

你的问题中的术语有点混乱。A是一种可以添加到字母或其他字符中的标记，但通常不能独立存在。（Unicode还使用更通用的术语组合字符。）

normalize（'NFD'，…）

所做的是将其转换为组件

无论如何，答案是œ不是预合成字符。这是一个：

unicodedata

模块没有提供将连接线拆分为各个部分的方法。但数据存在于字符名称中：

import re
import unicodedata

_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')

def split_ligatures(s):
    """
    Split the ligatures in `s` into their component letters. 
    """
    def untie(l):
        m = _ligature_re.match(unicodedata.name(l))
        if not m: return l
        elif m.group(1): return m.group(2)
        else: return m.group(2).lower()
    return ''.join(untie(l) for l in s)

>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'

（当然，您在实践中不会这样做：您应该按照您在问题中的建议预处理Unicode数据库以生成查找表。Unicode中没有那么多的连字。）

警告：基于包含“连字”的Unicode名称的方法不可靠。似乎有些连字的名称字符串中没有“连字”。例如，unicodedata.name（u'\xc6'）->“拉丁文大写字母AE”。还有ß（u+00DF），它被称为“拉丁文小写字母夏普s”，但可以被认为是双s连字。@Scott:你想让我删除这个答案吗？@GarethRees:保留你的答案，它很有用。据我统计，unicodedata有500多个名称中带有连字的代码点（基于），尽管其中许多代码点是用于其他语言的。我刚才提到了我的警告，让人们知道有一些角落案件。

import re
import unicodedata

_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')

def split_ligatures(s):
    """
    Split the ligatures in `s` into their component letters. 
    """
    def untie(l):
        m = _ligature_re.match(unicodedata.name(l))
        if not m: return l
        elif m.group(1): return m.group(2)
        else: return m.group(2).lower()
    return ''.join(untie(l) for l in s)

>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'