Python 2.7 从python I/O输出到文件的Unicode字符_Python 2.7_Unicode_Utf 8

Python 2.7 从python I/O输出到文件的Unicode字符

python-2.7 unicode utf-8

Python 2.7 从python I/O输出到文件的Unicode字符,python-2.7,unicode,utf-8,Python 2.7,Unicode,Utf 8,我不知道这是我对UTF-8还是python的误解，但我很难理解python是如何将Unicode字符写入文件的。顺便说一句，如果有区别的话，我现在在OSX下的Mac电脑上假设我有以下unicode字符串 foo=u'\x93智能报价中的内容\x94\n' 这里\x93和\x94是那些可怕的聪明引语然后我将其写入一个文件：打开（'file.txt'，'w'）作为文件的： file.write（foo.encode（'utf8'））当我在文本编辑器（如TextWrangler）或web浏览器

我不知道这是我对UTF-8还是python的误解，但我很难理解python是如何将Unicode字符写入文件的。顺便说一句，如果有区别的话，我现在在OSX下的Mac电脑上

假设我有以下unicode字符串

foo=u'\x93智能报价中的内容\x94\n'

这里\x93和\x94是那些可怕的聪明引语

然后我将其写入一个文件：

打开（'file.txt'，'w'）作为文件的

：
file.write（foo.encode（'utf8'））

当我在文本编辑器（如TextWrangler）或web浏览器中打开文件时，

file.txt

似乎是按照以下方式编写的

\xc2\x93**smartquotes中的内容\xc2\x94\n

文本编辑器正确地理解文件是UTF8编码的，但它将\xc2\x93呈现为垃圾。如果我进入并手动删除\xc2部分，就会得到我期望的结果，TextWrangler和Firefox会将utf字符呈现为smartquotes

这正是我在将文件读回python而不将其解码为“utf8”时得到的结果。但是，当我使用

read（）.decode（'utf8'）

方法读入它时，我会返回我最初输入的内容，没有\xc2位

这让我发疯，因为我正试图将一堆html文件解析为文本，而这些unicode字符的错误呈现正在把一堆东西搞砸

我也在python3中正常使用读/写方法进行了尝试，它也有相同的行为

编辑：关于手动剥离\xc2，当我这样做时，它显示正确，因为浏览器和文本编辑器默认为拉丁编码

此外，作为后续操作，Filefox将文本呈现为

☐smartquotes中的内容☐

其中，框是空的unicode值，而Chrome将文本呈现为

smartquotes中的内容

问题是，

u'\x93'

和

u'\x94'

不是智能引号的Unicode代码点。它们是编码中的智能引号，与编码不同。在拉丁语1中，未定义这些值

>>> import unicodedata as ud
>>> ud.name(u'\x93')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> import unicodedata as ud
>>> ud.name(u'\x94')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> ud.name(u'\u201c')
'LEFT DOUBLE QUOTATION MARK'
>>> ud.name(u'\u201d')
'RIGHT DOUBLE QUOTATION MARK'

或在UTF-8源文件中：

#coding:utf8
foo = u'“Stuff in smartquotes”'

编辑：如果您的Unicode字符串中有不正确的字节，这里有一种修复方法。前256个Unicode码点与

latin1

编码成1:1映射，因此可用于将误码的Unicode字符串直接编码回字节字符串，以便使用正确的解码：

>>> foo = u'\x93Stuff in smartquotes\x94'
>>> foo
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('windows-1252')
'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”

如果您有不正确Unicode字符的UTF-8编码版本：

>>> foo = '\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo = foo.decode('utf8').encode('latin1').decode('windows-1252')
>>> foo
u'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”

如果您遇到最坏的情况，请使用以下Unicode字符串：

>>> foo = u'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1') # back to a UTF-8 encoded byte string.
'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1').decode('utf8') # Undo the UTF-8, but Unicode is still wrong.
u'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1') # back to a byte string.
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1').decode('windows-1252') # Now decode correctly.
u'\u201cStuff in smartquotes\u201d'

问题是，

u'\x93'

和

u'\x94'

不是智能引号的Unicode代码点。它们是编码中的智能引号，与编码不同。在拉丁语1中，未定义这些值

>>> import unicodedata as ud
>>> ud.name(u'\x93')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> import unicodedata as ud
>>> ud.name(u'\x94')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> ud.name(u'\u201c')
'LEFT DOUBLE QUOTATION MARK'
>>> ud.name(u'\u201d')
'RIGHT DOUBLE QUOTATION MARK'

或在UTF-8源文件中：

#coding:utf8
foo = u'“Stuff in smartquotes”'

编辑：如果您的Unicode字符串中有不正确的字节，这里有一种修复方法。前256个Unicode码点与

latin1

编码成1:1映射，因此可用于将误码的Unicode字符串直接编码回字节字符串，以便使用正确的解码：

>>> foo = u'\x93Stuff in smartquotes\x94'
>>> foo
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('windows-1252')
'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”

如果您有不正确Unicode字符的UTF-8编码版本：

>>> foo = '\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo = foo.decode('utf8').encode('latin1').decode('windows-1252')
>>> foo
u'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”

如果您遇到最坏的情况，请使用以下Unicode字符串：

>>> foo = u'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1') # back to a UTF-8 encoded byte string.
'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1').decode('utf8') # Undo the UTF-8, but Unicode is still wrong.
u'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1') # back to a byte string.
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1').decode('windows-1252') # Now decode correctly.
u'\u201cStuff in smartquotes\u201d'

您是否告诉您的文本编辑器或web浏览器它应该将文件读取为UTF-8？我手动将文本编辑器和web浏览器都设置为将文件读取为UTF-8。您是否告诉您的文本编辑器或web浏览器它应该将文件读取为UTF-8？我手动将文本编辑器和web浏览器都设置为将文件读取为UTF-8。我不是试图编写正确的unicode本身，实际上我正在解析一些现有的html，这些html可能是从MS Word文档或其他东西转换而来的。我认为你是对的，但是如果我写

file.write（foo.encode（'latin1'））

，那么文本将正确地以拉丁或西方编码呈现。它似乎是拉丁编码。处理这类事情的正确方法是什么？@WildGunman，这不是

latin1

编码，而是

windows-1252

。见我的编辑上面。您可能在

latin1

中对不正确的Unicode字符串进行编码，但如果您看到的是智能引号，则您正在

Windows-1252

中查看该文件。您正在Windows中查看该文件吗？它将在

Windows-1252

中显示一个文件，即使是用

latin1

和

\x93

和

\x94

编码，也将其显示为智能引号。此外，您拥有的术语是.Holy heckballs。我觉得人们对Unicode的不满与其说是因为UTF8或ascii太糟糕，不如说是因为中间编码导致了这两种编码的不停痛苦。我不想写正确的Unicode本身，我实际上正在解析一些现有的html，这些html可能是由MS Word文档或其他东西转换而来的。我认为你是对的，但是如果我写

file.write（foo.encode（'latin1'））