如何跟踪python多处理池并在每次X迭代后运行函数?
我有一些简单的python多处理代码,如下所示:如何跟踪python多处理池并在每次X迭代后运行函数?,python,multiprocessing,Python,Multiprocessing,我有一些简单的python多处理代码,如下所示: files = ['a.txt', 'b.txt', 'c.txt', etc..] def convert_file(file): do_something(file) mypool = Pool(number_of_workers) mypool.map(convert_file, files) 我有10万个文件要通过convert_文件进行转换,我想运行一个功能,在这个功能中,我将每20个转换的文件上传到服务器,而不必等待所有文
files = ['a.txt', 'b.txt', 'c.txt', etc..]
def convert_file(file):
do_something(file)
mypool = Pool(number_of_workers)
mypool.map(convert_file, files)
我有10万个文件要通过convert_文件进行转换,我想运行一个功能,在这个功能中,我将每20个转换的文件上传到服务器,而不必等待所有文件被转换。我该怎么做呢?您可以在整个流程中使用一个共享变量来跟踪转换的文件。你可以找到一个例子
当进程想要读写时,变量会自动锁定。在锁定期间,要访问变量的所有其他进程都必须等待。因此,您可以在主循环中轮询变量并检查它是否大于20,同时转换过程会不断增加变量。一旦该值超过20,您将重置该值并将文件写入服务器。对于多处理,您在如何处理单个作业中发生的异常方面会遇到一些小问题。如果您使用map变量,那么您需要注意如何轮询结果,否则,如果map函数被迫引发异常,您可能会丢失一些结果。此外,除非您对工作中的任何异常进行了特殊处理,否则您甚至不知道哪个工作是问题所在。如果您使用apply变量,那么在获取结果时就不需要太小心,但是整理结果会变得有点棘手 总的来说,我认为map是最容易工作的 首先,您需要一个特殊的异常,它不能在主模块中创建,否则Python将无法正确地序列化和反序列化它 例如 自定义_.py main.py
您是否担心do_某事引发异常的可能性?如果是这样的话,那么你需要更加小心处理。@Dunes你能进一步澄清一下吗?我不期望出现异常,但这完全有可能。我查看了您提供的示例,并在我的测试示例中尝试了一些版本,但仍然不清楚如何做到这一点。我收到错误:UnboundLocalError:分配前引用的局部变量“XXX”谢谢,我尝试合并您的代码,并尝试在14个文件上使用6的块大小。我看了两次完整的调度,但之后什么都没有。。。僵尸进程?您的convert_file函数可能正在抛出实例BaseException,这可能会导致池挂起。尝试捕获作业中的BaseException而不是Exception,看看会发生什么。
class FailedJob(Exception):
pass
from multiprocessing import Pool
import time
import random
from custom_exceptions import FailedJob
def convert_file(filename):
# pseudo implementation to demonstrate what might happen
if filename == 'file2.txt':
time.sleep(0.5)
raise Exception
elif filename =='file0.txt':
time.sleep(0.3)
else:
time.sleep(random.random())
return filename # return filename, so we can identify the job that was completed
def job(filename):
"""Wraps any exception that occurs with FailedJob so we can identify which job failed
and why"""
try:
return convert_file(filename)
except Exception as ex:
raise FailedJob(filename) from ex
def main():
chunksize = 4 # number of jobs before dispatch
total_jobs = 20
files = list('file{}.txt'.format(i) for i in range(total_jobs))
with Pool() as pool:
# we use imap_unordered as we don't care about order, we want the result of the
# jobs as soon as they are done
iter_ = pool.imap_unordered(job, files)
while True:
completed = []
while len(completed) < chunksize:
# collect results from iterator until we reach the dispatch threshold
# or until all jobs have been completed
try:
result = next(iter_)
except StopIteration:
print('all child jobs completed')
# only break out of inner loop, might still be some completed
# jobs to dispatch
break
except FailedJob as ex:
print('processing of {} job failed'.format(ex.args[0]))
else:
completed.append(result)
if completed:
print('completed:', completed)
# put your dispatch logic here
if len(completed) < chunksize:
print('all jobs completed and all job completion notifications'
' dispatched to central server')
return
if __name__ == '__main__':
main()