Python 在文件多处理期间意外引发StopIteration_Python_Python Multiprocessing

Python 在文件多处理期间意外引发StopIteration

python

Python 在文件多处理期间意外引发StopIteration,python,python-multiprocessing,Python,Python Multiprocessing,我使用多处理来处理110组4个文件：即总共440个文件这些文件的行数都是相同的1040万行，因此for循环的结构是没有迭代器应该在其他迭代器完成之前完成的，我的代码的结构是这样的将这些文件解析为collections.Counter对象，并将计数器DICT聚合以进行进一步分析我的代码中有不可复制的StopIterations，有时我在一个多处理进程中得到StopIteration，有时在四个进程中的两个进程中得到相同的输入数据。我做错了什么？我从不使用少量的测试数据停止迭代我的伪代码

我使用多处理来处理110组4个文件：即总共440个文件这些文件的行数都是相同的1040万行，因此for循环的结构是没有迭代器应该在其他迭代器完成之前完成的，我的代码的结构是这样的

将这些文件解析为collections.Counter对象，并将计数器DICT聚合以进行进一步分析

我的代码中有不可复制的StopIterations，有时我在一个多处理进程中得到StopIteration，有时在四个进程中的两个进程中得到相同的输入数据。我做错了什么？我从不使用少量的测试数据停止迭代

我的伪代码

  def main(*args):
        #code that sets up dicts to route Counter data

        def dict_populator_worker_process(*input_file_tuple_list):
            worker_dict = Counter()
            my_subproc_read_dict = defaultdict(list)

            for index_file1 , index_file2 ,run_file1,run_file2 in input_file_tuple_list:
                index1_file_handle = open(index_file1,"rUb")
                index2_file_handle = open(index_file2,"rUb")
                run1_file_handle = open(run_file1,"rUb")
                run2_file_handle = open(run_file2,"rUb")
                for line in index1_file_handle:
                    index2_file_handle.next()
                    index_for_read = (index1_file_handle.next().strip(),index2_file_handle.next().strip())
                    worker_dict.update((index_for_read,))
                    for i in range(4):
                        try:
                            # THIS IS WHERE I SHOULD NOT BE Exhausting the Iterator 
                            my_subproc_read_dict[(index_for_read,1)].append(run1_file_handle.next())
                            my_subproc_read_dict[(index_for_read,2)].append(run2_file_handle.next())
                        except StopIteration:
                            # sometimes get this undeservedly
                            pass
                    # logger.info(index_for_read)
                    index1_file_handle.next() # Handles the +
                    index2_file_handle.next() # Handles the +
                    index1_file_handle.next() #Handles the Q
                    index2_file_handle.next() # Handles the Q

            # logger.info(worker_dict.keys())
            pid = multiprocessing.current_process()
            pickle.dump(worker_dict,open("counter_dict_{}.p".format(pid),"wb"))
            pickle.dump(my_subproc_read_dict,open("my_subproc_read_dict_{}.p".format(pid),"wb"))

我想知道，如果迭代器步调一致，并且linux shell split设置的所有文件的行数相等，包括相等的尾部文件，为什么会得到StopIteration。为了设置文件，我做了如下操作

all_index_file_tuples = zip(glob.glob("file1*pat.txt"),glob.glob("file2*pat.txt"),glob.glob("file3*pat.txt"),glob.glob("file4*pat.txt"))


chunksize = int(math.ceil(len(all_index_file_tuples) / float(NUM_PROCS)))
procs = []

for i in range(NUM_PROCS):
    p = multiprocessing.Process(target=dict_populator_worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)]))
    procs.append(p)
    p.start()

你试过调试程序pdb吗？file.next在文件到达EOF时引发StopIteration。运行\u file1/2是否被其他进程修改？你是否认为你的初始假设是错误的，并且你的程序证明你的文件没有相等的行数？运行文件1和2没有被其他进程修改，但是可能是子进程调用S拆开会导致文件打开并引起问题。我用wc-l run1*等检查了这些文件，并且100%确定它们的长度相等。因此，我可能会在拆分后显式关闭fleholds，这是同一个脚本的一部分，并避开这个问题。