Python zipfile模块-zipfile.write()文件,文件名中包含土耳其字符

Python zipfile模块-zipfile.write()文件,文件名中包含土耳其字符,python,filenames,zipfile,utf,Python,Filenames,Zipfile,Utf,在我的系统中有许多Word文档,我想使用Python模块zipfile压缩它们 我发现了我的问题,但在我的系统中,有些文件的文件名中包含德语umlauts和土耳其语字符 我采用了这样的方法,因此它可以处理文件名中的德语umlauts: def zipdir(path, ziph): for root, dirs, files in os.walk(path): for file in files: current_file = os.path.jo

在我的系统中有许多Word文档,我想使用Python模块
zipfile
压缩它们

我发现了我的问题,但在我的系统中,有些文件的文件名中包含德语umlauts土耳其语字符

我采用了这样的方法,因此它可以处理文件名中的德语umlauts

def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        for file in files:
            current_file = os.path.join(root, file)
            print "Adding to archive -> file: "+str(current_file)
            try:
                #ziph.write(current_file.decode("cp1250")) #German umlauts ok, Turkish chars not ok
                ziph.write(current_file.encode("utf-8")) #both not ok
                #ziph.write(current_file.decode("utf-8")) #both not ok
            except Exception,ex:
                print "exception ---> "+str(ex)
                print repr(current_file)
                raise
不幸的是,我试图为土耳其语字符添加逻辑的尝试仍然没有成功,留下了一个问题,即每次文件名包含土耳其语字符时,代码都会打印一个异常,例如:

exception ---> [Error 123] Die Syntax f³r den Dateinamen, Verzeichnisnamen oder
die Datentrõgerbezeichnung ist falsch: u'X:\\my\\path\\SomeTurk?shChar?shere.doc'
我试过几种字符串编码解码的方法,但都没有成功

有人能帮我吗?


我编辑了上面的代码以包含注释中提到的更改

现在显示以下错误:

...
Adding to archive -> file: X:\\my\path\blabla I blabla.doc
Adding to archive -> file: X:\my\path\bla bla³bla³bla³bla.doc
exception ---> 'ascii' codec can't decode byte 0xfc in position 24: ordinal not
in range(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla.doc'
Traceback (most recent call last):
  File "Backup.py", line 48, in <module>
    zipdir('X:\\my\\path', zipf)
  File "Backup.py", line 12, in zipdir
    ziph.write(current_file.encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 24: ordinal
 not in range(128)
。。。
添加到存档->文件:X:\\my\path\blabla I blablabla.doc
添加到存档->文件:X:\my\path\bla bla³bla³bla³bla.doc
异常-->“ascii”编解码器无法解码位置24处的字节0xfc:序号不
范围内(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla\xfcbla.doc'
回溯(最近一次呼叫最后一次):
文件“Backup.py”,第48行,在
zipdir('X:\\my\\path',zipf)
zipdir中第12行的文件“Backup.py”
写入(当前文件编码(“utf-8”))
UnicodeDecodeError:“ascii”编解码器无法解码位置24:ordinal中的字节0xfc
不在范围内(128)
³
实际上是一个德语
ü


编辑 在尝试了评论中的建议之后,我无法找到解决方案

因此,我转而使用Groovy编程语言并使用它的Zip功能


由于这是一个基于观点的讨论,我决定投票支持关闭该线程。

如果您以后不需要使用任何归档程序检查ZIP文件,您可以始终将其编码为base64,然后在使用Python提取时还原它们

对任何档案管理员来说,这些文件名看起来像胡言乱语,但编码将被保留

无论如何,要获得0-128 ASCII范围字符串(或Py3中的字节对象),必须进行编码(),而不是解码()

encode()将unicode()字符串序列化为ASCII范围

>>> u"\u0161blah".encode("utf-8")
'\xc5\xa1blah'
decode()从该值返回unicode():

任何其他代码页也是如此

很抱歉强调这一点,但人们有时会对编码和解码的东西感到困惑

如果您需要文件,但不太关心保存UMLAUTE和其他符号,则可以使用:

u"üsdlakui".encode("utf-8", "replace")
或:

这将用可能的字符替换未知字符,或完全忽略任何解码/编码错误

如果引发的错误类似于UnicodeDecodeError:无法解码字符,那么这将解决问题

但是,问题在于文件名只包含非拉丁字符

现在,一些可能真正起作用的东西:

那么

势必引发“ASCII编码错误”,因为字符串中没有定义unicode字符,而使用了othervise应用于描述unicode/UTF-8字符的非拉丁字符,但定义为ASCII-文件本身不是UTF-8编码的

而:

# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")

在文件顶部定义编码并保存为UTF-8编码的情况下,应该可以工作

是的,您确实有来自OS(文件名)的字符串,但从一开始这就是一个问题

即使编码正确,ZIP问题仍有待解决

根据规范,ZIP应该使用CP437存储文件名,但这种情况很少发生

大多数架构师使用默认的OS编码(Python中的MBCS)

而且大多数归档程序不支持UTF-8。所以,我在这里提出的建议应该有效,但不是对所有的档案管理员都有效

要告诉ZIP归档程序归档使用的是UTF-8文件名,标志位的第十一位应设置为True。正如我所说,他们中的一些人没有检查这一点。这是ZIP规范中的最新内容(好吧,几年前真的)

我不会在这里写完整的代码,只是需要理解的部分

# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D

import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
    filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()
我没有测试它,只是写了一段代码,但这是一个想法,即使在某个地方出现了一些bug


如果这不起作用,我不知道会发生什么。

当我执行
ziph.write(当前的\u文件编码(“utf-8”)
时,德语的umlauts和土耳其语字符会导致异常,与
ziph.write(当前的\u文件解码(“utf-8”)
一样。请参阅我的编辑,并向我们提供对代码所做更改的输出:try:ziph.write(当前文件。编码(“utf-8”);除了:打印报告(当前文件);提高;当然,注意缩进和换行:d抱歉,我迟到了。我没有看到你的编辑,因为我的收件箱中没有任何内容。现在我想我为你找到了一个解决方案。希望它能起作用。祝你好运!我会尽快尝试。
'Sömethüng'.encode("utf-8")
# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")
# -*- coding: UTF-8 -*-
unicode('Sömethüng').encode("utf-8")
# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D

import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
    filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()