String Python'；shell字符串的处理_String_Unicode_Encoding_Utf 8_Python 2.x

String Python'；shell字符串的处理

string unicode encoding utf-8

String Python'；shell字符串的处理,string,unicode,encoding,utf-8,python-2.x,String,Unicode,Encoding,Utf 8,Python 2.x,我仍然不完全理解python的unicode和str类型是如何工作的。注意：我使用的是Python2，据我所知，Python3对同一问题有完全不同的方法我所知道的： str是一种古老的野兽，它保存的字符串编码方式是历史迫使我们使用的太多编码方式之一 unicode是一种更标准化的表示字符串的方法，它使用一个包含所有可能字符、表情符号、狗屎的小图片等的巨大表格 decode函数将字符串转换为unicode，encode则相反如果我在python的shell中简单地说： >>>

我仍然不完全理解python的unicode和str类型是如何工作的。注意：我使用的是Python2，据我所知，Python3对同一问题有完全不同的方法

我所知道的：

str

是一种古老的野兽，它保存的字符串编码方式是历史迫使我们使用的太多编码方式之一

unicode

是一种更标准化的表示字符串的方法，它使用一个包含所有可能字符、表情符号、狗屎的小图片等的巨大表格

decode

函数将字符串转换为unicode，

encode

则相反

如果我在python的shell中简单地说：

>>> my_string = "some string"

然后

my_string

是一个

str

变量，用

ascii

编码（由于ascii是utf-8的一个子集，所以它也用

utf-8

编码）

因此，例如，我可以通过以下语句之一将其转换为

unicode

变量：

>>> my_string.decode('ascii')
u'some string'  
>>> my_string.decode('utf-8')
u'some string'

我所不知道的：

Python如何处理在shell中传递的非ascii字符串，知道了这一点，保存单词“kožušček”的正确方法是什么

例如，我可以说

>>> s1 = 'kožušček'

在这种情况下，

s1

成为我无法转换为

unicode

的

str

实例：

>>> s1='kožušček'
>>> s1
'ko\x9eu\x9a\xe8ek'
>>> print s1
kožušček
>>> s1.decode('ascii')

Traceback (most recent call last):
  File "<pyshell#23>", line 1, in <module>
    s1.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9e in position 2: ordinal not in range(128)

但是，当我打印s2时，我得到了

>>> print s2
kouèek

这意味着Python丢失了一整封信。有人能给我解释一下吗？

str

对象包含字节。这些字节代表的是什么，Python并没有规定。如果生成了与ASCII兼容的字节，则可以将它们解码为ASCII。如果它们包含表示UTF-8数据的字节，则可以对其进行解码。如果它们包含表示图像的字节，则可以解码该信息并在某处显示图像。当您在

str

对象上使用

repr（）

时，Python将保留任何ASCII可打印的字节，其余字节将转换为转义序列；这使得即使在仅ASCII的环境中调试此类信息也很实用

运行交互式解释器的终端或控制台将字节写入Python键入时读取的

stdin

流。这些字节根据终端或控制台的配置进行编码

在您的情况下，控制台很可能将您键入的输入编码到Windows代码页。您需要找出确切的代码页，并使用该编解码器对字节进行解码。代码页1252似乎适合：

>>> print 'ko\x9eu\x9a\xe8ek'.decode('cp1252')
kožušèek

当您打印这些相同的字节时，您的控制台正在读取这些字节，并在已配置的相同编解码器中对其进行解释

Python可以告诉您它认为您的控制台设置了什么编解码器；它尝试为Unicode文本检测此信息，其中必须为您解码输入。它使用来确定这一点，并且

sys.stdin

和

sys.stdout

对象具有

encoding

属性；矿山设置为UTF-8：

>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> 'kožušèek'
'ko\xc5\xbeu\xc5\xa1\xc3\xa8ek'
>>> u'kožušèek'
u'ko\u017eu\u0161\xe8ek'
>>> print u'kožušèek'
kožušèek

因为我的终端已经配置为UTF-8，而Python已经检测到了这一点，所以使用Unicode文本

u'…'

works。数据由Python自动解码

我不知道为什么你的安慰信丢失了一整封信；我必须访问您的控制台并进行更多的实验，查看

打印repr（s2）

的输出，并测试0x00和0xFF之间的所有字节，看看这是在控制台的输入端还是输出端

我建议您仔细阅读Python和Unicode：

内德·巴奇尔德
乔尔斯波尔斯基

您的系统不一定使用

sys.getdefaultencoding（）

编码；它只是在转换时使用的默认值，而不告诉它编码，如：

>>> sys.getdefaultencoding()
'ascii'
>>> unicode(s1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128)

使用这个我们可以解码字符串：

>>> u1=s1.decode(locale.getdefaultlocale()[1])
>>> u1
u'ko\u017eu\u0161\u010dek'
>>> print u1
kožušček

有可能没有设置区域设置，就像

'C'

区域设置一样。这可能会导致报告的编码为

None

，即使默认值为

“ascii”

。通常情况下，这是

setlocale

的工作，它将自动调用

getpreferredencoding

。我建议在程序启动时调用它一次，并保存返回的值以供以后使用。用于文件名的编码也可能是另一种情况，在sys.getfilesystemencoding（）中报告

Python内部默认编码由设置，其中包含：

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !

因此，如果希望在每次运行Python时都默认设置它，可以将第一个

if 0

更改为

if 1

你是说交互式口译员。它从

stdin

流中读取，在这里进行编码的是您的控制台或终端。请您指定您是在谈论python2还是python3？@MadMike:这显然是python2。@MartijnPieters，尽管读者中的专家很清楚这一点，这仍然应该在相关问题中提及：是cp1250，但谢谢！不过，第二个答案如何与此相符？为什么

u'kožušček'

会产生如此混乱的局面？感谢这些链接，我读了它们中的大多数（尤其是那些没有借口的链接），我遇到的问题只是shell->string部分。现在更清楚了。谢谢。如果这是在Windows命令提示符下，那么请知道该控制台在Unicode方面存在巨大的问题，至少在Python与Unicode的交互方式以及Microsoft所做的默认字体选择方面是如此。@5xum:您是否在使用空闲字体？有一个错误。Python2的其他部分也可能存在类似的错误，即读取Unicode文本部分也可能发生错误。如果运行：

print u'ko\u017eu\u0161\xe8ek'

，您会看到什么？（注意：文本中没有非ascii字符）。注意：

cp1250

很可能不是您的控制台编码（Windows使用di）

>>> u1=s1.decode(locale.getdefaultlocale()[1])
>>> u1
u'ko\u017eu\u0161\u010dek'
>>> print u1
kožušček

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !