Python—并行读取多个大文件并分别生成它们_Python_File Io_Bigdata

Python—并行读取多个大文件并分别生成它们

python file-io

Python—并行读取多个大文件并分别生成它们,python,file-io,bigdata,Python,File Io,Bigdata,我有多个大文件，需要一行一行地生成循环式文件。类似于以下伪代码： def get(self): with open(file_list, "r") as files: for file in files: yield file.readline() 我该如何做到这一点？使用上下文管理器会很棘手（或需要一些附加的库），但如果没有上下文管理器，应该不会很困难open（） def get(files_list): f

我有多个大文件，需要一行一行地生成循环式文件。类似于以下伪代码：

    def get(self):
        with open(file_list, "r") as files:
            for file in files:
                yield file.readline()

我该如何做到这一点？

使用上下文管理器会很棘手（或需要一些附加的库），但如果没有上下文管理器，应该不会很困难

open（）
def get(files_list):
  file_handles = [open(f, 'r') for f in files_list]
  while file_handles:
    for fd in file_handles:
      line = fd.readline()
      if line:
        yield line
      else:
        file_handles.remove(fd)

我假设您希望继续运行，直到从每个文件中读取每一行，较短的文件在到达EOF时会掉落。
itertools

有几个配方，其中有一个非常简洁的循环配方。我还可以使用多个文件上下文管理器：

from itertools import cycle, islice
from contextlib import ExitStack

# https://docs.python.org/3.8/library/itertools.html#itertools-recipes
def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Recipe credited to George Sakkis
    num_active = len(iterables)
    nexts = cycle(iter(it).__next__ for it in iterables)
    while num_active:
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            # Remove the iterator we just exhausted from the cycle.
            num_active -= 1
            nexts = cycle(islice(nexts, num_active))

...

def get(self):
    with open(files_list) as fl:
        filenames = [x.strip() for x in fl]
    with ExitStack() as stack:
        files = [stack.enter_context(open(fname)) for fname in filenames]
        yield from roundrobin(*files)

尽管如此，也许最好的设计是使用控制反转，并将文件对象序列作为参数提供给

.get

，因此调用代码应注意使用退出堆栈：

class Foo:
    ...
    def get(self, files):
        yield from roundrobin(*files)

# calling code:
foo = Foo() # or however it is initialized

with open(files_list) as fl:
    filenames = [x.strip() for x in fl]
with ExitStack() as stack:
    files = [stack.enter_context(open(fname)) for fname in filenames]
    for line in foo.get(files):
        do_something_with_line(line)

发现一个可能的重复项：

文件列表

是一个文件名字符串的python列表。

open

不接受列表作为输入，是吗？它采用类似于对象的

str

，

字节

，

os.path

。我错过什么了吗？不，你是对的。然而，我的代码只是伪代码。我假设存在这样一个“开放”命令。