Python gzip CRC检查失败_Python_Gzip_Python Multiprocessing_Crc

Python gzip CRC检查失败

python

Python gzip CRC检查失败,python,gzip,python-multiprocessing,crc,Python,Gzip,Python Multiprocessing,Crc,我有一个包含大量文本文件的文件夹。每一个都是gzip，重数千兆字节。我编写了一段代码来分割每个gzip文件的内容：每个gzip文件都是用gzip打开的，然后每个指定的行块都被读写到一个新的gzip文件中以下是文件file_compression.py中的代码： import sys, os, file_manipulation as fm import gzip def splitGzipFile(fileName, dest=None, chunkPerSplit=100, lineP

我有一个包含大量文本文件的文件夹。每一个都是gzip，重数千兆字节。我编写了一段代码来分割每个gzip文件的内容：每个gzip文件都是用

gzip

打开的，然后每个指定的行块都被读写到一个新的gzip文件中

以下是文件

file_compression.py

中的代码：

import sys, os, file_manipulation as fm
import gzip


def splitGzipFile(fileName, dest=None, chunkPerSplit=100, linePerChunk=4, file_field_separator="_", zfill=3
                  , verbose=False, file_permission=None, execute=True):
    """
    Splits a gz file into chunk files.
    :param fileName:
    :param chunkPerSplit:
    :param linePerChunk:
    :return:
    """
    absPath = os.path.abspath(fileName)
    baseName = os.path.basename(absPath)
    dirName = os.path.dirname(absPath)
    destFolder = dirName if dest is None else dest


    ## Compute file fields
    rawBaseName, extensions = baseName.split(os.extsep, 1)

    if not str(extensions).startswith("."):
        extensions = "." + extensions

    file_fields = str(rawBaseName).split(file_field_separator)
    first_fields = file_fields[:-1] if file_fields.__len__() > 1 else file_fields
    first_file_part = file_field_separator.join(first_fields)
    last_file_field = file_fields[-1] if file_fields.__len__() > 1 else ""
    current_chunk = getCurrentChunkNumber(last_file_field)
    if current_chunk is None or current_chunk < 0:
        first_file_part = rawBaseName

    ## Initialize chunk variables
    linePerSplit = chunkPerSplit * linePerChunk
    # chunkCounter = 0

    chunkCounter = 0 if current_chunk is None else current_chunk-1

    for chunk in getFileChunks(fileName, linePerSplit):
        print "writing " + str(str(chunk).__len__()) + " ..."
        chunkCounter += 1
        oFile = fm.buildPath(destFolder) + first_file_part + file_field_separator + str(chunkCounter).zfill(zfill) + extensions

        if execute:
            writeGzipFile(oFile, chunk, file_permission)
        if verbose:
            print "Splitting: created file ", oFile



def getCurrentChunkNumber(chunk_field):
    """
    Tries to guess an integer from a string.
    :param chunk_field:
    :return: an integer, None if failure.
    """
    try:
        return int(chunk_field)
    except ValueError:
        return None


def getFileChunks(fileName, linePerSplit):
    with gzip.open(fileName, 'rb') as f:
        print "gzip open"
        lineCounter = 0
        currentChunk = ""
        for line in f:
            currentChunk += line
            lineCounter += 1
            if lineCounter >= linePerSplit:
                yield currentChunk
                currentChunk = ""
                lineCounter = 0
        if not currentChunk == '':
            yield currentChunk


def writeGzipFile(file_name, content, file_permission=None):
    import gzip
    with gzip.open(file_name, 'wb') as f:
        if not content == '':
            f.write(content)

    if file_permission is not None and type(file_permission) == int:
        os.chmod(file_name, file_permission)

到目前为止，代码一直运行良好。但今天我遇到了gzip文件CRC损坏的问题：

Process Process-3:72:

Traceback (most recent call last):

  ...

  File "/.../tools/file_utils/file_compression.py", line 43, in splitGzipFile

    for chunk in getFileChunks(fileName, linePerSplit):

  File "/.../tools/file_utils/file_compression.py", line 70, in getFileChunks

    for line in f:

  File "/.../python2.7/lib/python2.7/gzip.py", line 450, in readline

    c = self.read(readsize)

  File "/.../python2.7/lib/python2.7/gzip.py", line 256, in read

    self._read(readsize)

  File "/.../python2.7/lib/python2.7/gzip.py", line 320, in _read

    self._read_eof()

  File "/.../python2.7/lib/python2.7/gzip.py", line 342, in _read_eof

    hex(self.crc)))

IOError: CRC check failed 0xddbb6045 != 0x34fd5580L

这个问题的根源可能是什么？我必须再次声明，到目前为止，它已经做到了始终工作，文件夹和文件始终具有相同的结构。这个例子中的区别可能是我的脚本处理的gzip文件比平时多，可能是平时的两倍

这可能是同一时间访问相同文件的问题吗？但是，我非常怀疑，我通过在我的split_Seed列表中注册访问的每个文件来确保情况并非如此

我愿意接受任何暗示，因为我没有更多的线索去哪里寻找

编辑1

可能是其他人或其他程序访问了某些打开的文件？我不能要求和依赖证明。因此，首先，如果我放置一个

多进程.Lock

，它会阻止任何其他线程、进程、程序、用户等修改该文件吗？还是只限于Python？我在上面找不到任何文档。

我在运行了几个月的代码上遇到了完全相同的错误。结果表明，该特定文件的文件源已损坏。我回到一个旧文件，它工作正常，我使用了一个新文件，它也工作正常。

我也有同样的问题。我刚刚删除了旧文件，重新运行了代码

rm-rf/tmp/imagenet/

您是否检查了文件名为

fileName

的文件是否因使用

gunzip

而损坏？

Process Process-3:72:

Traceback (most recent call last):

  ...

  File "/.../tools/file_utils/file_compression.py", line 43, in splitGzipFile

    for chunk in getFileChunks(fileName, linePerSplit):

  File "/.../tools/file_utils/file_compression.py", line 70, in getFileChunks

    for line in f:

  File "/.../python2.7/lib/python2.7/gzip.py", line 450, in readline

    c = self.read(readsize)

  File "/.../python2.7/lib/python2.7/gzip.py", line 256, in read

    self._read(readsize)

  File "/.../python2.7/lib/python2.7/gzip.py", line 320, in _read

    self._read_eof()

  File "/.../python2.7/lib/python2.7/gzip.py", line 342, in _read_eof

    hex(self.crc)))

IOError: CRC check failed 0xddbb6045 != 0x34fd5580L