Python 使用多处理进行读取、压缩和写入_Python_Multiprocessing

Python 使用多处理进行读取、压缩和写入

python

Python 使用多处理进行读取、压缩和写入,python,multiprocessing,Python,Multiprocessing,我正在压缩文件。一个进程对其中的一些进程来说是不错的，但我正在压缩数千个进程，这可能（而且已经）需要几天时间，所以我想通过多进程来加速它。我建议我应该避免让多个进程同时读取文件，我猜我不应该让多个进程同时写入文件。这是我当前单独运行的方法： import tarfile, bz2, os def compress(folder): "compresses a folder into a file" bz_file = bz2.BZ2File(folder+'.tbz', 'w'

我正在压缩文件。一个进程对其中的一些进程来说是不错的，但我正在压缩数千个进程，这可能（而且已经）需要几天时间，所以我想通过多进程来加速它。我建议我应该避免让多个进程同时读取文件，我猜我不应该让多个进程同时写入文件。这是我当前单独运行的方法：

import tarfile, bz2, os
def compress(folder):
    "compresses a folder into a file"

    bz_file = bz2.BZ2File(folder+'.tbz', 'w')

    with tarfile.open(mode='w', fileobj = bz_file) as tar:

        for fn in os.listdir(folder):

            read each file in the folder and do some pre processing
            that will make the compressed file much smaller than without

            tar.addfile( processed file )

    bz_file.close()
    return

这是将文件夹中的所有内容压缩到一个文件中。这使它们更容易处理，更有条理。如果我只是把它扔进一个池中，那么我会有几个过程同时读写，所以我想避免这种情况。我可以重做它，这样只有一个过程在读取文件，但我仍然有多个过程在写入：

import multiprocessing as mp
import tarfile, bz2, os

def compress(file_list):
    folder = file_list[0]
    bz_file = bz2.BZ2File(folder+'.tbz', 'w')

    with tarfile.open(mode='w', fileobj = bz_file) as tar:

        for i in file_list[1:]:
            preprocess file data
            tar.addfile(processed data)

    bz_file.close()
    return

cpu_count = mp.cpu_count()
p = mp.Pool(cpu_count)

for subfolder in os.listdir(main_folder):

    read all files in subfolder into memory, place into file_list
    place file_list into fld_list until fld_list contains cpu_count
    file lists. then pass to  p.map(compress, fld_list)

这仍然有许多进程同时写入压缩文件。只要告诉tarfile使用哪种压缩方式，就会开始向硬盘写入数据。我无法读取所有需要压缩到内存中的文件，因为我没有足够的RAM来执行此操作，因此我多次重新启动Pool.map也是一个问题

如何在单个进程中读取和写入文件，同时在多个进程中进行所有压缩，同时避免多次重新启动多处理.Pool？

与其使用

多处理.Pool

，不如使用

多处理.Queue

并创建收件箱和发件箱

启动一个进程来读取文件并将数据放入收件箱队列，并对队列的大小进行限制，这样就不会填满RAM。这里的示例压缩单个文件，但可以调整以一次处理整个文件夹

def reader(inbox, input_path, num_procs):
    "process that reads in files to be compressed and puts to inbox"

    for fn in os.listdir(input_path):
        path = os.path.join(input_path, fn)

        # read in each file, put data into inbox
        fname = os.path.basename(fn)
        with open(fn, 'r') as src: lines = src.readlines()

        data = [fname, lines]
        inbox.put(data)

    # read in everything, add finished notice for all running processes
    for i in range(num_procs):
        inbox.put(None)  # when a compressor sees a None, it will stop
    inbox.close()
    return

但这只是问题的一半，另一部分是压缩文件而不必将其写入磁盘。我们为压缩函数提供一个

StringIO

对象，而不是一个打开的文件；它被传递到

tarfile

。压缩后，我们将StringIO对象放入发件箱队列

但我们不能这样做，因为StringIO对象不能被pickle，只有对象才能进入队列。但是，StringIO的

getvalue

函数可以以可拾取的格式提供内容，因此使用getvalue抓取内容，关闭StringIO对象，然后将内容放入发件箱

from io import StringIO
import tarfile

def compressHandler(inbox, outbox):
    "process that pulls from inbox, compresses and puts to outbox"
    supplier = iter(inbox.get, None)  # stops when gets a None
    while True:
        try:
            data = next(supplier)  # grab data from inbox
            pressed = compress(data)  # compress it
            ou_que.put(pressed)  # put into outbox
        except StopIteration:
            outbox.put(None)  # finished compressing, inform the writer
            return  # and quit

def compress(data):
    "compress file"
    bz_file = StringIO()

    fname, lines = dat  # see reader def for package order

    with tarfile.open(mode='w:bz2', fileobj=bz_file) as tar:

        info = tarfile.TarInfo(fname)  # store file name
        tar.addfile(info, StringIO(''.join(lines)))  # compress

    data = bz_file.getvalue()
    bz_file.close()
    return data

然后，writer进程从发件箱队列中提取内容并将其写入磁盘。此函数需要知道启动了多少个压缩进程，以便只在听到每个进程都已停止时才知道停止

def writer(outbox, output_path, num_procs):
    "single process that writes compressed files to disk"
    num_fin = 0

    while True:
        # all compression processes have finished
        if num_finished >= num_procs: break

        tardata = outbox.get()

        # a compression process has finished
        if tardata == None:
            num_fin += 1
            continue

        fn, data = tardata
        name = os.path.join(output_path, fn) + '.tbz'

        with open(name, 'wb') as dst: dst.write(data)
    return

最后，还有一个将它们放在一起的设置

import multiprocessing as mp
import os

def setup():
    fld = 'file/path'

    # multiprocess setup
    num_procs = mp.cpu_count()

    # inbox and outbox queues
    inbox = mp.Queue(4*num_procs)  # limit size 
    outbox = mp.Queue()

    # one process to read
    reader = mp.Process(target = reader, args = (inbox, fld, num_procs))
    reader.start()

    # n processes to compress
    compressors = [mp.Process(target = compressHandler, args = (inbox, outbox))
                   for i in range(num_procs)]
    for c in compressors: c.start()

    # one process to write
    writer = mp.Process(target = writer, args=(outbox, fld, num_procs))
    writer.start()
    writer.join()  # wait for it to finish
    print('done!')

您必须了解

pbzip2

的作用并模拟它。将队列与多进程或多线程结合使用。首先，一个进程读取所有文件并将它们放入队列1。第二，多进程从队列1获取文件并进行压缩，然后将结果放入队列2。最后，一个进程从队列2获取并执行写操作。