Python、windows控制台和编码（CP850与cp1252）_Python_Windows_Encoding

Python、windows控制台和编码（CP850与cp1252）

python windows encoding

Python、windows控制台和编码（CP850与cp1252）,python,windows,encoding,Python,Windows,Encoding,我原以为我知道编码和Python的一切，但今天我遇到了一个奇怪的问题：尽管控制台设置为代码页850——Python正确地报告了它——但我放在命令行上的参数似乎是在代码页1252中编码的。如果我尝试用sys.stdin.encoding解码它们，我会得到错误的结果。如果我假设“cp1252”，忽略sys.stdout.encoding报告的内容，它就可以工作我是否遗漏了什么，或者这是Python中的一个bug？窗户？注意：我在Windows7en上运行Python2.6.6，语言环境设置为法语（

我原以为我知道编码和Python的一切，但今天我遇到了一个奇怪的问题：尽管控制台设置为代码页850——Python正确地报告了它——但我放在命令行上的参数似乎是在代码页1252中编码的。如果我尝试用sys.stdin.encoding解码它们，我会得到错误的结果。如果我假设“cp1252”，忽略sys.stdout.encoding报告的内容，它就可以工作

我是否遗漏了什么，或者这是Python中的一个bug？窗户？注意：我在Windows7en上运行Python2.6.6，语言环境设置为法语（瑞士）

在下面的测试程序中，我检查文本是否被正确解释并可以打印——这很有效。但我在命令行上传递的所有值似乎都编码错误：

#!/usr/bin/python
# -*- encoding: utf-8 -*-
import sys

literal_mb = 'utf-8 literal:   üèéÃÂç€ÈÚ'
literal_u = u'unicode literal: üèéÃÂç€ÈÚ'
print "Testing literals"
print literal_mb.decode('utf-8').encode(sys.stdout.encoding,'replace')
print literal_u.encode(sys.stdout.encoding,'replace')

print "Testing arguments ( stdin/out encodings:",sys.stdin.encoding,"/",sys.stdout.encoding,")"
for i in range(1,len(sys.argv)):
    arg = sys.argv[i]
    print "arg",i,":",arg
    for ch in arg:
        print "  ",ch,"->",ord(ch),
        if ord(ch)>=128 and sys.stdin.encoding == 'cp850':
            print "<-",ch.decode('cp1252').encode(sys.stdout.encoding,'replace'),"[assuming input was actually cp1252 ]"
        else:
            print ""

我得到以下输出

Testing literals
utf-8 literal:   üèéÃÂç?ÈÚ
unicode literal: üèéÃÂç?ÈÚ
Testing arguments ( stdin/out encodings: cp850 / cp850 )
arg 1 : abcÚÇ
   a -> 97
   b -> 98
   c -> 99
   Ú -> 233 <- é [assuming input was actually cp1252 ]
   Ç -> 128 <- ? [assuming input was actually cp1252 ]

你知道我错过了什么吗

编辑1:我刚刚通过阅读sys.stdin进行了测试。这和预期的一样：在cp850中，键入“é”将导致序数值130。因此，问题实际上只是针对命令行。那么，命令行的处理方式是否与标准输入不同

编辑2:看来我的关键词错了。我在SO上找到了另一个非常接近的话题：。尽管如此，如果命令行的编码与sys.stdin不同，并且由于sys.getdefaultencoding（）报告“ascii”，那么似乎无法知道其实际编码。我发现使用win32扩展的答案相当粗糙。

回答我自己：

在Windows上，控制台使用的编码（即sys.stdin/out的编码）不同于操作系统提供的各种字符串的编码-通过例如OS.getenv（）、sys.argv，当然还有更多字符串获得

sys.getdefaultencoding（）提供的编码实际上是一种默认编码，Python开发人员选择它来匹配解释器在极端情况下使用的“最合理的编码”。我在Python2.6上得到了“ascii”，并尝试了便携式Python3.1，它产生了“utf-8”。这两种方法都不是我们想要的——它们只是编码转换函数的后备方法

正如所述，OS提供的字符串使用的编码由活动代码页（ACP）控制。因为Python没有本机函数来检索它，所以我不得不使用ctypes：

from ctypes import cdll
os_encoding = 'cp' + str(cdll.kernel32.GetACP())

编辑：但正如Jacek所建议的，实际上有一种更健壮、更具Python风格的方法（需要验证，但在证明错误之前，我将使用此方法）

然后

u_argv = [x.decode(os_encoding) for x in sys.argv]
u_env = os.getenv('myvar').decode(os_encoding)

在我的系统上，

os_encoding='cp1252'

，所以它可以工作。我很确定这会在其他平台上被打破，所以请随意编辑并使其更通用。我们当然需要在Windows报告的ACP和Python编码名称之间建立某种转换表，而不仅仅是在“cp”前面加上前缀

不幸的是，这是一个黑客行为，尽管我发现它比（我的问题的编辑2中提到的SO问题链接到的）建议的攻击性要小一些。我在这里看到的优点是，这可以应用于os.getenv（），而不仅仅是sys.argv。

我尝试了这些解决方案。它可能仍然存在一些编码问题。我们需要使用真正的字体。修正：

在cmd中运行chcp 65001以将编码更改为UTF-8

将cmd字体更改为真正的类型，如支持 65001之前的代码页

以下是我对编码错误的完整修复：

def fixCodePage():
    import sys
    import codecs
    import ctypes
    if sys.platform == 'win32':
        if sys.stdout.encoding != 'cp65001':
            os.system("echo off")
            os.system("chcp 65001") # Change active page code
            sys.stdout.write("\x1b[A") # Removes the output of chcp command
            sys.stdout.flush()
        LF_FACESIZE = 32
        STD_OUTPUT_HANDLE = -11
        class COORD(ctypes.Structure):
        _fields_ = [("X", ctypes.c_short), ("Y", ctypes.c_short)]

        class CONSOLE_FONT_INFOEX(ctypes.Structure):
            _fields_ = [("cbSize", ctypes.c_ulong),
            ("nFont", ctypes.c_ulong),
            ("dwFontSize", COORD),
            ("FontFamily", ctypes.c_uint),
            ("FontWeight", ctypes.c_uint),
            ("FaceName", ctypes.c_wchar * LF_FACESIZE)]

        font = CONSOLE_FONT_INFOEX()
        font.cbSize = ctypes.sizeof(CONSOLE_FONT_INFOEX)
        font.nFont = 12
        font.dwFontSize.X = 7
        font.dwFontSize.Y = 12
        font.FontFamily = 54
        font.FontWeight = 400
        font.FaceName = "Lucida Console"
        handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
        ctypes.windll.kernel32.SetCurrentConsoleFontEx(handle, ctypes.c_long(False), ctypes.pointer(font))

注意：您可以在执行程序时看到字体更改。

对于Linux，通常是

locale.getpreferredencoding（）

或在使用

locale.setlocale（）

–

locale.getlocale（）[1]

为控制台和环境访问提供正确的编码。尽管如此，硬编码UTF-8对于大多数现代系统来说已经足够好了（因此它是最好的后备值）。您可以使用

os.system（'chcp 65001>nul'）

删除

chcp

的输出。

import locale
os_encoding = locale.getpreferredencoding()
# This returns 'cp1252' on my system, yay!

u_argv = [x.decode(os_encoding) for x in sys.argv]
u_env = os.getenv('myvar').decode(os_encoding)

def fixCodePage():
    import sys
    import codecs
    import ctypes
    if sys.platform == 'win32':
        if sys.stdout.encoding != 'cp65001':
            os.system("echo off")
            os.system("chcp 65001") # Change active page code
            sys.stdout.write("\x1b[A") # Removes the output of chcp command
            sys.stdout.flush()
        LF_FACESIZE = 32
        STD_OUTPUT_HANDLE = -11
        class COORD(ctypes.Structure):
        _fields_ = [("X", ctypes.c_short), ("Y", ctypes.c_short)]

        class CONSOLE_FONT_INFOEX(ctypes.Structure):
            _fields_ = [("cbSize", ctypes.c_ulong),
            ("nFont", ctypes.c_ulong),
            ("dwFontSize", COORD),
            ("FontFamily", ctypes.c_uint),
            ("FontWeight", ctypes.c_uint),
            ("FaceName", ctypes.c_wchar * LF_FACESIZE)]

        font = CONSOLE_FONT_INFOEX()
        font.cbSize = ctypes.sizeof(CONSOLE_FONT_INFOEX)
        font.nFont = 12
        font.dwFontSize.X = 7
        font.dwFontSize.Y = 12
        font.FontFamily = 54
        font.FontWeight = 400
        font.FaceName = "Lucida Console"
        handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
        ctypes.windll.kernel32.SetCurrentConsoleFontEx(handle, ctypes.c_long(False), ctypes.pointer(font))