Python2.7在Python中同时解码UTF-8和unicode转义会导致UnicodeEncodeError_Python_Text_Unicode_Encoding_Compiler Errors

Python2.7在Python中同时解码UTF-8和unicode转义会导致UnicodeEncodeError

python text unicode encoding compiler-errors

Python2.7在Python中同时解码UTF-8和unicode转义会导致UnicodeEncodeError,python,text,unicode,encoding,compiler-errors,Python,Text,Unicode,Encoding,Compiler Errors,我有一个tsv文件，在某些行中，特定列包含混合格式，例如：Hapoel\u Be\u0027er\u Sheva\u a\u002eF\u002eC\u002e，应该是Hapoel\u Be'er\u Sheva\u a.F.C. 下面是我用来读取文件和拆分列的代码： with open(path, 'rb') as f: for line in f: cols = line.decode('utf-8').split('\t') text = cols[3].dec

我有一个tsv文件，在某些行中，特定列包含混合格式，例如：

Hapoel\u Be\u0027er\u Sheva\u a\u002eF\u002eC\u002e

，应该是

Hapoel\u Be'er\u Sheva\u a.F.C.

下面是我用来读取文件和拆分列的代码：

with open(path, 'rb') as f:
  for line in f:
      cols = line.decode('utf-8').split('\t')
      text = cols[3].decode('unicode-escape') #Here is the column that has the above mentioned mixed format

错误消息：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0160' in position 6: ordinal not in range(128)

我想知道如何在读取文件时将第一种混合格式转换为另一种格式？我正在使用python 2.7

非常感谢，

您可以使用

ast.literal\u eval

将原始字节转换为unicode

import ast

raw_bytes = br'Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e'
print(raw_bytes)  # b'Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e'

unicode_string = ast.literal_eval('"{}"'.format(raw_bytes.decode('utf8')))

unicode字符串的输出：

Hapoel_Be'er_Sheva_A.F.C.

更新-在Python2.7中测试并使用charm

您可以使用

ast.literal\u eval

将原始字节转换为unicode

import ast

raw_bytes = br'Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e'
print(raw_bytes)  # b'Hapoel_Be\u0027er_Sheva_A\u002eF\u002eC\u002e'

unicode_string = ast.literal_eval('"{}"'.format(raw_bytes.decode('utf8')))

unicode字符串的输出：

Hapoel_Be'er_Sheva_A.F.C.

更新-在Python2.7中测试并使用charm

您可以使用

解码（'unicode-escape'）

将这些十六进制序列转换为字符

>>> 'Hapoel_Be\\u0027er_Sheva_A\\u002eF\\u002eC\\u002e'.decode('unicode-escape')
u"Hapoel_Be'er_Sheva_A.F.C."

编辑：根据您对该问题的更新，实际上您的十六进制序列和Unicode字符的组合超出了ASCII范围。该错误来自Python 2.7在尝试对Unicode字符串使用

.decode（）

时尝试的自动转换-

decode

仅对字节字符串有效，因此它尝试使用

ASCII

编解码器从Unicode转换。Python 3不允许出现这种错误

要解决这个问题，您需要进行双重转换，一个将这些非ASCII字符转换为十六进制序列，另一个将它们转换回。

'unicode-escape'

编解码器会将反斜杠加倍，因此也必须纠正这些反斜杠

>>> print u'Hapoel_Be\\u0027er_Sheva_A\\u002eF\\u002eC\\u002e\u0160'.encode('unicode-escape').replace(b'\\\\u', b'\\u').decode('unicode-escape')
Hapoel_Be'er_Sheva_A.F.C.Š

您可以使用

decode（'unicode-escape'）

将这些十六进制序列转换为字符

>>> 'Hapoel_Be\\u0027er_Sheva_A\\u002eF\\u002eC\\u002e'.decode('unicode-escape')
u"Hapoel_Be'er_Sheva_A.F.C."

编辑：根据您对该问题的更新，实际上您的十六进制序列和Unicode字符的组合超出了ASCII范围。该错误来自Python 2.7在尝试对Unicode字符串使用

.decode（）

时尝试的自动转换-

decode

仅对字节字符串有效，因此它尝试使用

ASCII

编解码器从Unicode转换。Python 3不允许出现这种错误

要解决这个问题，您需要进行双重转换，一个将这些非ASCII字符转换为十六进制序列，另一个将它们转换回。

'unicode-escape'

编解码器会将反斜杠加倍，因此也必须纠正这些反斜杠

>>> print u'Hapoel_Be\\u0027er_Sheva_A\\u002eF\\u002eC\\u002e\u0160'.encode('unicode-escape').replace(b'\\\\u', b'\\u').decode('unicode-escape')
Hapoel_Be'er_Sheva_A.F.C.Š

这是python 2还是python 3？@FHTMitchell抱歉，忘记指定了。是Python2.7。这是Python2还是Python3？@FHTMitchell抱歉忘了指定。这是python 2.7。该死，这比mine@FHTMitchell我总是认为任何一种<代码> EVA<代码>是最后的选择。Python的指导原则之一是，应该总是有一种显而易见的方法来做某事，但它确实严重违反了这一原则。@MarkRansom它导致了一个错误。我会发布更多的细节和错误信息。该死，这比mine@FHTMitchell我总是认为任何一种<代码> EVA<代码>是最后的选择。Python的指导原则之一是，应该总是有一种显而易见的方法来做某事，但它确实严重违反了这一原则。@MarkRansom它导致了一个错误。我将发布更多详细信息和错误消息。感谢您的努力，但它引发了一个错误。我想这是因为我已经解码了整行内容（问题被相应地编辑）。感谢您的努力，但它引发了一个错误。我想这是因为我已经解码了整行内容（问题被相应地编辑）。