Python zipfile模块-zipfile.write()文件,文件名中包含土耳其字符
在我的系统中有许多Word文档,我想使用Python模块Python zipfile模块-zipfile.write()文件,文件名中包含土耳其字符,python,filenames,zipfile,utf,Python,Filenames,Zipfile,Utf,在我的系统中有许多Word文档,我想使用Python模块zipfile压缩它们 我发现了我的问题,但在我的系统中,有些文件的文件名中包含德语umlauts和土耳其语字符 我采用了这样的方法,因此它可以处理文件名中的德语umlauts: def zipdir(path, ziph): for root, dirs, files in os.walk(path): for file in files: current_file = os.path.jo
zipfile
压缩它们
我发现了我的问题,但在我的系统中,有些文件的文件名中包含德语umlauts和土耳其语字符
我采用了这样的方法,因此它可以处理文件名中的德语umlauts:
def zipdir(path, ziph):
for root, dirs, files in os.walk(path):
for file in files:
current_file = os.path.join(root, file)
print "Adding to archive -> file: "+str(current_file)
try:
#ziph.write(current_file.decode("cp1250")) #German umlauts ok, Turkish chars not ok
ziph.write(current_file.encode("utf-8")) #both not ok
#ziph.write(current_file.decode("utf-8")) #both not ok
except Exception,ex:
print "exception ---> "+str(ex)
print repr(current_file)
raise
不幸的是,我试图为土耳其语字符添加逻辑的尝试仍然没有成功,留下了一个问题,即每次文件名包含土耳其语字符时,代码都会打印一个异常,例如:
exception ---> [Error 123] Die Syntax f³r den Dateinamen, Verzeichnisnamen oder
die Datentrõgerbezeichnung ist falsch: u'X:\\my\\path\\SomeTurk?shChar?shere.doc'
我试过几种字符串编码解码的方法,但都没有成功
有人能帮我吗?
我编辑了上面的代码以包含注释中提到的更改 现在显示以下错误:
...
Adding to archive -> file: X:\\my\path\blabla I blabla.doc
Adding to archive -> file: X:\my\path\bla bla³bla³bla³bla.doc
exception ---> 'ascii' codec can't decode byte 0xfc in position 24: ordinal not
in range(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla.doc'
Traceback (most recent call last):
File "Backup.py", line 48, in <module>
zipdir('X:\\my\\path', zipf)
File "Backup.py", line 12, in zipdir
ziph.write(current_file.encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 24: ordinal
not in range(128)
。。。
添加到存档->文件:X:\\my\path\blabla I blablabla.doc
添加到存档->文件:X:\my\path\bla bla³bla³bla³bla.doc
异常-->“ascii”编解码器无法解码位置24处的字节0xfc:序号不
范围内(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla\xfcbla.doc'
回溯(最近一次呼叫最后一次):
文件“Backup.py”,第48行,在
zipdir('X:\\my\\path',zipf)
zipdir中第12行的文件“Backup.py”
写入(当前文件编码(“utf-8”))
UnicodeDecodeError:“ascii”编解码器无法解码位置24:ordinal中的字节0xfc
不在范围内(128)
³
实际上是一个德语ü
编辑 在尝试了评论中的建议之后,我无法找到解决方案 因此,我转而使用Groovy编程语言并使用它的Zip功能
由于这是一个基于观点的讨论,我决定投票支持关闭该线程。如果您以后不需要使用任何归档程序检查ZIP文件,您可以始终将其编码为base64,然后在使用Python提取时还原它们 对任何档案管理员来说,这些文件名看起来像胡言乱语,但编码将被保留 无论如何,要获得0-128 ASCII范围字符串(或Py3中的字节对象),必须进行编码(),而不是解码() encode()将unicode()字符串序列化为ASCII范围
>>> u"\u0161blah".encode("utf-8")
'\xc5\xa1blah'
decode()从该值返回unicode():
任何其他代码页也是如此
很抱歉强调这一点,但人们有时会对编码和解码的东西感到困惑
如果您需要文件,但不太关心保存UMLAUTE和其他符号,则可以使用:
u"üsdlakui".encode("utf-8", "replace")
或:
这将用可能的字符替换未知字符,或完全忽略任何解码/编码错误
如果引发的错误类似于UnicodeDecodeError:无法解码字符,那么这将解决问题
但是,问题在于文件名只包含非拉丁字符
现在,一些可能真正起作用的东西:
那么
势必引发“ASCII编码错误”,因为字符串中没有定义unicode字符,而使用了othervise应用于描述unicode/UTF-8字符的非拉丁字符,但定义为ASCII-文件本身不是UTF-8编码的
而:
# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")
或
在文件顶部定义编码并保存为UTF-8编码的情况下,应该可以工作
是的,您确实有来自OS(文件名)的字符串,但从一开始这就是一个问题
即使编码正确,ZIP问题仍有待解决
根据规范,ZIP应该使用CP437存储文件名,但这种情况很少发生
大多数架构师使用默认的OS编码(Python中的MBCS)
而且大多数归档程序不支持UTF-8。所以,我在这里提出的建议应该有效,但不是对所有的档案管理员都有效
要告诉ZIP归档程序归档使用的是UTF-8文件名,标志位的第十一位应设置为True。正如我所说,他们中的一些人没有检查这一点。这是ZIP规范中的最新内容(好吧,几年前真的)
我不会在这里写完整的代码,只是需要理解的部分
# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D
import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()
我没有测试它,只是写了一段代码,但这是一个想法,即使在某个地方出现了一些bug
如果这不起作用,我不知道会发生什么。当我执行
ziph.write(当前的\u文件编码(“utf-8”)
时,德语的umlauts和土耳其语字符会导致异常,与ziph.write(当前的\u文件解码(“utf-8”)
一样。请参阅我的编辑,并向我们提供对代码所做更改的输出:try:ziph.write(当前文件。编码(“utf-8”);除了:打印报告(当前文件);提高;当然,注意缩进和换行:d抱歉,我迟到了。我没有看到你的编辑,因为我的收件箱中没有任何内容。现在我想我为你找到了一个解决方案。希望它能起作用。祝你好运!我会尽快尝试。
'Sömethüng'.encode("utf-8")
# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")
# -*- coding: UTF-8 -*-
unicode('Sömethüng').encode("utf-8")
# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D
import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()