在Python2.6中有使用unicode_文本的陷阱吗？_Python_Unicode_Python 2.6_Unicode Literals

在Python2.6中有使用unicode_文本的陷阱吗？

python unicode

在Python2.6中有使用unicode_文本的陷阱吗？,python,unicode,python-2.6,unicode-literals,Python,Unicode,Python 2.6,Unicode Literals,我们已经在Python2.6下运行了我们的代码库。为了准备Python 3.0，我们开始添加： from __future__ import unicode_literals 从未来导入unicode文字进入我们的.py文件（当我们修改它们时）。我想知道是否有其他人一直在这样做，并且遇到了任何不明显的问题（可能是在花了大量时间调试之后）。我确实发现，如果您添加unicode\u literals指令，您还应该添加如下内容： # -*- coding: utf-8 到.py文件的第一行或第二

我们已经在Python2.6下运行了我们的代码库。为了准备Python 3.0，我们开始添加：

from __future__ import unicode_literals 从未来导入unicode文字

进入我们的

.py

文件（当我们修改它们时）。我想知道是否有其他人一直在这样做，并且遇到了任何不明显的问题（可能是在花了大量时间调试之后）。

我确实发现，如果您添加

unicode\u literals

指令，您还应该添加如下内容：

 # -*- coding: utf-8

到.py文件的第一行或第二行。除此之外，还包括以下行：

 foo = "barré"

导致以下错误：

SyntaxError: Non-ASCII character '\xc3' in file mumble.py on line 198, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details 语法错误：第198行mumble.py文件中的非ASCII字符“\xc3”，但未声明编码；看见http://www.python.org/peps/pep-0263.html 详情

我在使用unicode字符串时遇到的问题的主要来源是当您将utf-8编码字符串与unicode字符串混合使用时

例如，考虑以下脚本。

2.py

# encoding: utf-8
name = 'helló wörld from two'

1.py

# encoding: utf-8
from __future__ import unicode_literals
import two
name = 'helló wörld from one'
print name + two.name

运行

python one.py

的输出是：

Traceback (most recent call last):
  File "one.py", line 5, in <module>
    print name + two.name
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

输出：

DEBUG: <html><body>helló wörld</body></html>

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

输出：

DEBUG: <html><body>helló wörld</body></html>

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print 'DEBUG: %s' % html
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16: ordinal not in range(128)

回溯（最近一次呼叫最后一次）：
文件“test.py”，第6行，在
打印“调试：%s”%html
UnicodeDecodeError:“ascii”编解码器无法解码位置16中的字节0xc3:序号不在范围内（128）

它失败是因为

'DEBUG:%s'

是一个unicode字符串，因此python尝试解码

html

。修复打印的两种方法是执行

print str（'DEBUG:%s'）%html

或

print'DEBUG:%s'%html.decode（'utf-8'）

我希望这有助于您理解使用unicode字符串时可能遇到的问题。

同样在2.6（在python 2.6.5 RC1+之前）中，unicode文本不能很好地处理关键字参数（）：

例如，以下代码不使用unicode_文字，但由于TypeError而失败：

关键字必须是string

（如果使用unicode_文字）

  >>> def foo(a=None): pass
  ...
  >>> foo(**{'a':1})
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
      TypeError: foo() keywords must be strings

>def foo（a=None）：通过
...
>>>foo（**{a'：1}）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
TypeError:foo（）关键字必须是字符串

还应考虑到

unicode_literal

将影响

eval（）

，但不会影响

repr（）

（一种不对称行为，imho是一个bug），即

eval（repr（b'\xa4'））

将不等于

b'\xa4'

（与Python 3一样）

理想情况下，对于

unicode\u文本

和Python{2.7,3.x}用法的所有组合，以下代码将是一个不变量，应该始终有效：

from __future__ import unicode_literals

bstr = b'\xa4'
assert eval(repr(bstr)) == bstr # fails in Python 2.7, holds in 3.1+

ustr = '\xa4'
assert eval(repr(ustr)) == ustr # holds in Python 2.7 and 3.1+

第二个断言恰好起作用，因为在Python2.7中，

repr（'\xa4'）

的计算结果为

u'\xa4'

。

还有更多

有些库和内置程序希望字符串不支持unicode

两个例子：

内置：

myenum = type('Enum', (), enum)

（有点深奥）不适用于unicode_文本：type（）需要一个字符串

图书馆：

from wx.lib.pubsub import pub
pub.sendMessage("LOG MESSAGE", msg="no go for unicode literals")

不工作：wx pubsub库需要字符串消息类型

前者是深奥的，很容易用语言来固定

myenum = type(b'Enum', (), enum)

但是，如果您的代码中充满了对pub.sendMessage（）的调用（我的是pub.sendMessage（）），那么后者将是毁灭性的

该死的，嗯

如果在使用

的位置导入了任何具有来自未来导入unicode文本的模块，请单击.echo
。这是一场噩梦…
仅供参考，python 2.6.5 RC1+已经解决了这个问题。我建议使用decode（）
解决方案，而不是str（）
或encode（）
解决方案：使用Unicode对象的频率越高，代码就越清晰，因为您想要的是操作字符串，不是外部隐含编码的字节数组。请修复您的术语将utf-8编码字符串与unicode字符串混合时
utf-8和unicode没有两种不同的编码；Unicode是一种标准，UTF-8是它定义的编码之一。@Kos：我想他的意思是将“UTF-8编码字符串”对象与Unicode（因此解码）对象混合。前者是str
类型，后者是unicode
类型。作为不同的对象，如果尝试对它们求和/连接/插值，可能会出现问题。这是否适用于python>=2.6
或python==2.6
？@IanMackinnon:python 3假设文件是UTF8default@endolith：但是Python 2没有，如果您在注释中使用非ascii字符，它将给出语法错误！因此，IMHO-*-编码：utf-8
实际上是一个强制性语句，无论您使用的是unicode\u文本还是非-*-
；如果您打算采用与emacs兼容的方式，我认为您需要-*-编码：utf-8-*-
（也请参见末尾的-*-
）。你所需要的就是编码：utf-8
（甚至是=
而不是：
）。无论你是否从未来导入unicode文本，你都会遇到这个错误。Emacs兼容性-*-编码：utf-8-*-
与“编码”（不是“编码”或“文件编码”或其他任何东西-Python只是寻找“编码”不管前缀是什么），类型的东西也会泄漏到元类中-因此在Django中，您在class Meta:
中声明的任何字符串都应该是b'field\u name'
耶。。。在我的例子中，我意识到搜索并用b'版本替换所有sendMessage字符串是值得的。如果你想避免可怕的“解码”异常，没有什么比在你的程序中严格使用unicode、根据需要转换输入和输出更好的了（在我读过的一些关于这个主题的文章中提到的“unicode三明治”）。总的来说，unicode_文字对我来说是一个巨大的胜利……我觉得这里更大的问题是您正在使用repr
重新生成对象。报告明确指出，这不是一项要求。在我看来，这将repr
降级为只对调试有用的东西。