Python抛出UnicodeCodeError，尽管我正在执行str.decode（）。为什么？_Python_String_Encoding_Escaping

Python抛出UnicodeCodeError，尽管我正在执行str.decode（）。为什么？

python string encoding

Python抛出UnicodeCodeError，尽管我正在执行str.decode（）。为什么？,python,string,encoding,escaping,Python,String,Encoding,Escaping,考虑这一功能： def escape(text): print repr(text) escaped_chars = [] for c in text: try: c = c.decode('ascii') except UnicodeDecodeError: c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])

考虑这一功能：

def escape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        try:
            c = c.decode('ascii')
        except UnicodeDecodeError:
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

它应该通过相应的htmlentitydefs转义所有非ascii字符。不幸的是，python抛出

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

当变量

text

包含

repr（）

为

u'Tam\xe1s Horv\xe1th'

的字符串时

但是，我不使用str.encode（）。我只使用

str.decode（）

。我错过了什么吗？

您正在传递一个已经是unicode的字符串。因此，在Python调用解码之前，它必须对其进行编码——默认情况下，它使用ASCII编码

编辑以添加这取决于您要执行的操作。如果您只是想将带有非ASCII字符的unicode字符串转换为HTML编码的表示形式，您可以通过一次调用来实现：

text.encode（'ASCII'，'xmlcharrefreplace'）

decode

str

没有意义

我想您可以检查一下

ord（c）>127

Python有两种类型的字符串：字符串（unicode类型）和字节字符串（str类型）。粘贴的代码对字节字符串进行操作。您需要一个类似的函数来处理字符串

也许是这样：

def uescape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        if (ord(c) < 32) or (ord(c) > 126):
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

def uescape（文本）：
打印报告（文本）
转义字符=[]
对于文本中的c：
如果（作战需求文件（c）<32）或（作战需求文件（c）>126）：
c='&{}；'。格式（htmlentitydefs.codepoint2name[ord（c）]）
转义字符追加（c）
返回“”。加入（转义字符）

我不知道这两种功能对你来说是否真的是必要的。如果是我，我会选择UTF-8作为结果文档的字符编码，以字符串形式处理文档（无需担心实体），并执行

content.encode（'UTF-8'）

作为将文档交付给客户端之前的最后一步。根据所选择的web框架，您甚至可以将字符串直接传递到API，并让它了解如何设置编码。

这是一个误导性的错误报告，来自python处理反/编码过程的方式。您第二次尝试解码一个已经解码的字符串，这混淆了Python函数，而Python函数反过来又混淆了您据我所知，编码/解码过程由编解码器模块进行。这种误导性的异常信息的来源就在那里

你可以自己检查一下：要么

u'\x80'.encode('ascii')

或

将抛出Unicode编码错误，其中

u'\x80'.encode('utf8')
不会，但是

u'\x80'.decode('utf8')
我会的
我猜你被编码和解码的含义弄糊涂了。简单来说：

decode encode ByteString (ascii) --------> UNICODE ---------> ByteString (utf8) codec codec
但是为什么
decode
方法有一个
codec
-参数呢？好的，底层函数无法猜测ByteString是用哪个编解码器编码的，因此它将
codec
作为一个参数。如果未提供，则假定您指的是隐式使用的
sys.getdefaultencoding（）
因此，当您使用
c.decode（'ascii'）
时，您a）有一个（编码的）ByteString（这就是您使用decode的原因）b）您希望获得一个unicode表示对象（这就是您使用decode的目的）和c）编码ByteString的编解码器是ascii
另见：

当我遇到这个问题时，这个答案总是适用于我：

def byteify(input): ''' Removes unicode encodings from the given input string. ''' if isinstance(input, dict): return {byteify(key):byteify(value) for key,value in input.iteritems()} elif isinstance(input, list): return [byteify(element) for element in input] elif isinstance(input, unicode): return input.encode('utf-8') else: return input
从
我在
重新加载（系统） sys.setdefaultencoding（“拉丁语-1”） a=u'\xe1' 打印str（a）#无例外
还是我逃避角色的方法毫无意义？非常感谢你的详细解释。
def byteify(input): ''' Removes unicode encodings from the given input string. ''' if isinstance(input, dict): return {byteify(key):byteify(value) for key,value in input.iteritems()} elif isinstance(input, list): return [byteify(element) for element in input] elif isinstance(input, unicode): return input.encode('utf-8') else: return input
reload(sys) sys.setdefaultencoding("latin-1") a = u'\xe1' print str(a) # no exception