在Windows上从Python 2.x中的命令行参数读取Unicode字符_Python_Windows_Command Line_Unicode_Python 2.x

在Windows上从Python 2.x中的命令行参数读取Unicode字符

python windows command-line unicode

在Windows上从Python 2.x中的命令行参数读取Unicode字符,python,windows,command-line,unicode,python-2.x,Python,Windows,Command Line,Unicode,Python 2.x,我希望我的Python脚本能够在Windows中读取Unicode命令行参数。但是，sys.argv似乎是一个以某些本地编码编码的字符串，而不是Unicode。如何读取完整Unicode格式的命令行示例代码：argv.py import sys first_arg = sys.argv[1] print first_arg print type(first_arg) print first_arg.encode("hex") print open(first_arg) 在我的电脑上设置日语

我希望我的Python脚本能够在Windows中读取Unicode命令行参数。但是，sys.argv似乎是一个以某些本地编码编码的字符串，而不是Unicode。如何读取完整Unicode格式的命令行

示例代码：

argv.py

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)
print first_arg.encode("hex")
print open(first_arg)

在我的电脑上设置日语代码页，我得到：

C:\temp>argv.py "PC・ソフト申請書08.09.24.doc"
PC・ソフト申請書08.09.24.doc
<type 'str'>
50438145835c83748367905c90bf8f9130382e30392e32342e646f63
<open file 'PC・ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>

C:\temp>argv.py“PC・ソフト申請書08.09.24.文件“
个人计算机・ソフト申請書08.09.24.doc
50438145835c83748367905c90bf8f9130382e30392e32342e646f63

我相信这是Shift-JIS编码的，它对那个文件名“有效”。但如果文件名中的字符不在Shift JIS字符集中，则会中断最后的“打开”调用失败：

C:\temp>argv.py Jörgen.txt
Jorgen.txt
<type 'str'>
4a6f7267656e2e747874
Traceback (most recent call last):
  File "C:\temp\argv.py", line 7,
in <module>
    print open(first_arg)
IOError: [Errno 2] No such file or directory: 'Jorgen.txt'

C:\temp>argv.py Jörgen.txt
Jorgen.txt
4a6f7267656e2e747874
回溯（最近一次呼叫最后一次）：
文件“C:\temp\argv.py”，第7行，
在里面
打印打开（第一个参数）
IOError:[Errno 2]没有这样的文件或目录：“Jorgen.txt”

注意——我说的是Python2.x，不是Python3.0。我发现Python3.0将

sys.argv

作为合适的Unicode。但现在过渡到Python3.0还为时过早（因为缺乏第三方库支持）

更新：

一些答案说我应该根据

sys.argv

编码的内容进行解码。问题是它不是完整的Unicode，所以有些字符是不可表示的

这是一个让我感到悲伤的用例：我有。我有各种字符的文件名，包括一些不在系统默认代码页中的字符。在所有情况下，当字符在当前代码页编码中不可表示时，我的Python脚本都无法通过sys.argv获得正确的Unicode文件名

当然，有一些Windows API可以使用完整的Unicode读取命令行（Python 3.0就是这样做的）。我假设Python 2.x解释器没有使用它。

命令行可能是Windows编码。尝试将参数解码为

unicode

对象：

args = [unicode(x, "iso-8859-9") for x in sys.argv]

试试这个：

import sys
print repr(sys.argv[1].decode('UTF-8'))

也许你必须用

CP437

或

CP1252

代替

UTF-8

。您应该能够从注册表项

HKEY\U LOCAL\U MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP

推断出正确的编码名称。下面是一个解决方案，它正是我要寻找的，调用Windows

GetCommandLineArgvW

函数：
（来自ActiveState）

但我做了一些修改，以简化其使用并更好地处理某些用途。以下是我使用的：

win32\u unicode\u argv.py

"""
win32_unicode_argv.py

Importing this will replace sys.argv with a full Unicode form.
Windows only.

From this site, with adaptations:
      http://code.activestate.com/recipes/572200/

Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""


import sys

def win32_unicode_argv():
    """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
    strings.

    Versions 2.x of Python don't support Unicode in sys.argv on
    Windows, with the underlying Windows API instead replacing multi-byte
    characters with '?'.
    """

    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR

    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR

    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)

    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

sys.argv = win32_unicode_argv()

现在，我使用它的方式就是：

import sys
import win32_unicode_argv

从那时起，

sys.argv

就是一个Unicode字符串列表。Python

optparse

模块似乎很乐意解析它，这很好。

处理编码非常混乱

我相信，如果您通过命令行输入数据，它会将数据编码为您的系统编码，而不是unicode。（即使是复制/粘贴也可以做到这一点）

因此，使用系统编码解码为unicode应该是正确的：

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)

first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
print first_arg_unicode
print type(first_arg_unicode)

f = codecs.open(first_arg_unicode, 'r', 'utf-8')
unicode_text = f.read()
print type(unicode_text)
print unicode_text.encode(sys.getfilesystemencoding())

运行以下命令将输出：提示符>python myargv.py“PC・ソフト申請書08.09.24.txt“

此外，如果您处理的是编码文件，您可能希望使用codecdes.open（）函数代替内置的open（）。它允许您定义文件的编码，然后使用给定的编码将内容透明地解码为unicode

unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')

因此，当您调用

content=codecs.open（“myfile.txt”、“r”、“utf8”）时，read（）将使用unicode格式
codecs.open：

如果我不明白，请告诉我
如果您还没有读过Joel关于unicode和编码的文章，我建议您阅读：
是的，那会有用的。只要去掉结尾处的“.encode（'utf-8'）”，当我将文件拖放到py文件时，这段代码对我不起作用。但是，当我在命令提示符下键入文件名时，此代码可以工作。我写了一个C++程序来调用GETMuleCdLeWW，如果我把文件拖放到程序中，程序可以正确显示文件名。这是必要的。我已经有一段时间没有这么做了（和其他公司），但我想我一定启用了长文件名。@CraigMcQueen我没有启用它。但我的python程序仍然能够接受拖放文件。我的程序只接受文件名作为参数。然后它将以十六进制形式显示文件名。我发现一些字符变成了0x3f（“？”）。我已经将此代码集成到我的win unicode控制台包的开发版本中：.-1“iso-8859-9”不是Windows编码。您刚刚使问题变得更糟。检查这里关于堆栈溢出的问题，它应该提供您问题的答案：是的，似乎是一个精确的dup。该问题及其答案讨论了原始输入（）。我对命令行很感兴趣，例如sys.argv。实际上，您可以这样在sys.argv上循环：对于sys.argv中的arg:print arg.decode（“utf-8”），我使用print，但您可以做任何需要做的事情。您还需要选择所需的正确编码。默认情况下，日文Windows是否在控制台上使用光栅字体？这可能会限制它在Windows-932代码页中显示字符。请参阅（这是阅读args的另一个问题，但可能会有一些影响）
unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')