Python 3.x 特殊字符的编码和解码(拉丁语-1)

Python 3.x 特殊字符的编码和解码(拉丁语-1),python-3.x,decode,python-unicode,unicode-string,Python 3.x,Decode,Python Unicode,Unicode String,我试图在HTML解析后清除一些奇怪的unicode字符,但仍然没有转换这些unicode 原文: raw = 'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.' 编码和解码后: text = str(raw.encode().decode('unicode_escape')) 电流输出: 'If further information is needed

我试图在HTML解析后清除一些奇怪的unicode字符,但仍然没有转换这些unicode

原文:

raw = 'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'
编码和解码后:

text = str(raw.encode().decode('unicode_escape'))
电流输出:

'If further information is needed, donÃ\x82´t hesitate to contact us. Kind regards, JosÃ\x83© Ramirez'
期望输出:

'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez'

你做得不对。
raw.encode().decode('unicode_escape')
的效果与
raw.encode('utf-8').decode('latin-1')
的效果相同。你真正想要的是:

>>> raw.encode('latin-1').decode('utf-8')
'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'
您的字符串来自接受UTF-8编码文本的人,但假设它是拉丁语-1

如果您有许多不同的Mojibake变体(不正确的文本解码,导致胡言乱语),这些软件包可以帮助您:

>>> import ftfy
>>> ftfy.fix_text('If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.')
'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'