Python 生成多个进程来编写不同的文件_Python

Python 生成多个进程来编写不同的文件

python

Python 生成多个进程来编写不同的文件,python,Python,其思想是使用N进程编写N文件要写入的文件的数据来自多个文件，这些文件存储在字典中，字典中有一个列表作为值，如下所示： dic = {'file1':['data11.txt', 'data12.txt', ..., 'data1M.txt'], 'file2':['data21.txt', 'data22.txt', ..., 'data2M.txt'], ... 'fileN':['dataN1.txt', 'dataN2.txt', ...,

其思想是使用

进程编写

文件

要写入的文件的数据来自多个文件，这些文件存储在字典中，字典中有一个列表作为值，如下所示：

dic = {'file1':['data11.txt', 'data12.txt', ..., 'data1M.txt'],
       'file2':['data21.txt', 'data22.txt', ..., 'data2M.txt'], 
        ...
       'fileN':['dataN1.txt', 'dataN2.txt', ..., 'dataNM.txt']}

jobs = []
for d in dic:
    outfile = str(d)+"_merged.txt"
    with open(outfile, 'w') as out:
        p = multiprocessing.Process(target = merger.merger, args=(dic[d], name, out))
        jobs.append(p)
        p.start()
        out.close()

def merger(files, name, outfile):
    time.sleep(2)
    sys.stdout.write("Merging %n...\n" % name)

    # the reason for this step is that all the different files have a header
    # but I only need the header from the first file.
    with open(files[0], 'r') as infile:
        for line in infile:
            print "writing to outfile: ", name, line
            outfile.write(line) 
    for f in files[1:]:
        with open(f, 'r') as infile:
            next(infile) # skip first line
            for line in infile:
                outfile.write(line)
    sys.stdout.write("Done with: %s\n" % name)

所以

file1

是

data11+data12+…+数据1M

等

因此，我的代码如下所示：

dic = {'file1':['data11.txt', 'data12.txt', ..., 'data1M.txt'],
       'file2':['data21.txt', 'data22.txt', ..., 'data2M.txt'], 
        ...
       'fileN':['dataN1.txt', 'dataN2.txt', ..., 'dataNM.txt']}

jobs = []
for d in dic:
    outfile = str(d)+"_merged.txt"
    with open(outfile, 'w') as out:
        p = multiprocessing.Process(target = merger.merger, args=(dic[d], name, out))
        jobs.append(p)
        p.start()
        out.close()

def merger(files, name, outfile):
    time.sleep(2)
    sys.stdout.write("Merging %n...\n" % name)

    # the reason for this step is that all the different files have a header
    # but I only need the header from the first file.
    with open(files[0], 'r') as infile:
        for line in infile:
            print "writing to outfile: ", name, line
            outfile.write(line) 
    for f in files[1:]:
        with open(f, 'r') as infile:
            next(infile) # skip first line
            for line in infile:
                outfile.write(line)
    sys.stdout.write("Done with: %s\n" % name)

merge.py看起来像这样：

dic = {'file1':['data11.txt', 'data12.txt', ..., 'data1M.txt'],
       'file2':['data21.txt', 'data22.txt', ..., 'data2M.txt'], 
        ...
       'fileN':['dataN1.txt', 'dataN2.txt', ..., 'dataNM.txt']}

jobs = []
for d in dic:
    outfile = str(d)+"_merged.txt"
    with open(outfile, 'w') as out:
        p = multiprocessing.Process(target = merger.merger, args=(dic[d], name, out))
        jobs.append(p)
        p.start()
        out.close()

def merger(files, name, outfile):
    time.sleep(2)
    sys.stdout.write("Merging %n...\n" % name)

    # the reason for this step is that all the different files have a header
    # but I only need the header from the first file.
    with open(files[0], 'r') as infile:
        for line in infile:
            print "writing to outfile: ", name, line
            outfile.write(line) 
    for f in files[1:]:
        with open(f, 'r') as infile:
            next(infile) # skip first line
            for line in infile:
                outfile.write(line)
    sys.stdout.write("Done with: %s\n" % name)

我确实看到文件写在它应该去的文件夹上，但它是空的。没有标题，什么都没有。我在里面放了指纹，看看是否一切都是正确的，但没有任何效果

救命啊

由于辅助进程与创建它们的主进程并行运行，因此名为

out

的文件在辅助进程可以写入它们之前被关闭。即使删除

out.close（）

，由于

with

语句，也会发生这种情况。而是将文件名传递给每个进程，让进程打开和关闭文件。

问题是您没有关闭子进程中的文件，因此内部缓冲数据丢失。您可以将文件移动到打开给孩子的位置，或者将整个文件包装在try/finally块中，以确保文件关闭。在父级中打开的一个潜在优势是，您可以在父级中处理文件错误。我不是说它很有吸引力，只是一种选择

def merger(files, name, outfile):
    try:
        time.sleep(2)
        sys.stdout.write("Merging %n...\n" % name)

        # the reason for this step is that all the different files have a header
        # but I only need the header from the first file.
        with open(files[0], 'r') as infile:
            for line in infile:
                print "writing to outfile: ", name, line
                outfile.write(line) 
        for f in files[1:]:
            with open(f, 'r') as infile:
                next(infile) # skip first line
                for line in infile:
                    outfile.write(line)
        sys.stdout.write("Done with: %s\n" % name)
    finally:
        outfile.close()

import multiprocessing as mp
import os
import time

if os.path.exists('mytestfile.txt'):
    os.remove('mytestfile.txt')

def worker(f, do_close=False):
    time.sleep(2)
    print('writing')
    f.write("this is data")
    if do_close:
        print("closing")
        f.close()


print('without close')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, False))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())

print('with close')
os.remove('mytestfile.txt')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, True))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())

更新

关于父/子文件描述符以及子文件中的文件发生了什么，存在一些混淆。如果程序退出时文件仍处于打开状态，则底层C库不会将数据刷新到磁盘。理论上说，一个正常运行的程序在退出之前会关闭一些东西。下面是一个示例，其中子级由于未关闭文件而丢失数据

import multiprocessing as mp
import os
import time

if os.path.exists('mytestfile.txt'):
    os.remove('mytestfile.txt')

def worker(f, do_close=False):
    time.sleep(2)
    print('writing')
    f.write("this is data")
    if do_close:
        print("closing")
        f.close()


print('without close')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, False))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())

print('with close')
os.remove('mytestfile.txt')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, True))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())

我在linux上运行它，得到

without close
writing
file data: 
with close
writing
closing
file data: this is data

在

p.start（）

之后立即调用

out.close（）

。我怀疑合并任务在文件从下面关闭之前是否有时间执行。@Blorgbeard很好，但仍然没有什么…这是在类似linux的操作系统上，对吗？@Blorgbeard关闭父级中的只读文件不会影响子级中的文件。如果有要刷新的写入数据，这将是一个问题，但这里不是这样。@t请确保在父级中打开/关闭的文件是一个写访问文件。我说的是

open（outfile，'w'）

和

out.close（）

@Pavlos不，保持相同数量的进程，但只传递文件名而不是文件对象。但是关闭父级中的文件对子级来说应该不是问题。我不知道这是怎么解决的@tdelaney，因为父级在子级有机会写入文件之前关闭了该文件，而一旦文件关闭，您就无法写入该文件。不，它不是这样工作的。子项是用文件描述符的独立副本生成的。父级可以关闭其版本，但这对子级没有影响。这里真正发生的是OP没有关闭子文件中的文件，因此它的未写入数据被丢弃。当OP更改为在子对象中打开文件时，他也更改为在子对象中关闭文件。这才是真正解决问题的原因。@tdelaney我想你是对的，我忘了进程有单独的副本。下面是我在Windows（python 2和3）上得到的结果：-tldr:错误。这不是意外的。Windows尝试重新打开该文件，但该文件未打开以供共享。没错。。。。完全不同。