Python 更高效地处理大型列表（内存方面）_Python_Performance_List_Multiprocessing_Generator

Python 更高效地处理大型列表（内存方面）

python performance list

Python 更高效地处理大型列表（内存方面）,python,performance,list,multiprocessing,generator,Python,Performance,List,Multiprocessing,Generator,我正在尝试读取大文件（~10GB）的文本数据并将每个字符串放入一个列表中 corpus = [] for file in files: fc = [] with open(file) as source: # Use Multiprocessing to read all lines and add them to the list filewords = pool.map(addline, source)

我正在尝试读取大文件（~10GB）的文本数据并将每个字符串放入一个列表中

corpus = []
for file in files:
        fc = []

        with open(file) as source:
            # Use Multiprocessing to read all lines and add them to the list
            filewords = pool.map(addline, source)

            #Concatenate each sublist in filewords to one list with all stringwords
            filewords = list(itertools.chain(*filewords))

        corpus.append(filewords)

#do something with list
function(corpus)

我应该做些什么来提高内存效率？

也许是发电机？（我没有这方面的经验）

在这种情况下，我实际上不一定要使用

多处理。10GB没有那么多，您可以轻松完成以下简单操作：
for file in files:
   with open(file) as source:
        for line in source:
             # process

如果要使用群集，请不要使用多处理，而是使用群集的API。
如Antti Happala所建议的，查看是否是一个可用的解决方案
如果没有，您可能可以使用生成器，但这实际上取决于您对~10 GB文本文件所做的操作。如果你沿着生成器的道路走下去，我建议你创建一个类并重写_uiter__;方法。这样，如果您必须多次迭代文件，则始终会得到一个从文件开头开始的生成器
如果在函数之间传递生成器，这一点很重要

从函数生成的生成器返回对生成器的引用以进行迭代
重写iter返回一个新的生成器

函数生成器：
def iterfile(my_file):
    with open(my_file) as the_file:
        for line in the_file:
            yield line

class IterFile(object):

    def __init__(self, my_file):
        self.my_file = my_file

    def __iter__(self):
        with open(self.my_file) as the_file:
            for line in the_file:
                yield line

__iter发生器：
def iterfile(my_file):
    with open(my_file) as the_file:
        for line in the_file:
            yield line

class IterFile(object):

    def __init__(self, my_file):
        self.my_file = my_file

    def __iter__(self):
        with open(self.my_file) as the_file:
            for line in the_file:
                yield line

行为差异：
>>> func_gen = iterfile('/tmp/junk.txt')
>>> iter(func_gen) is iter(func_gen)
True

>>> iter_gen = IterFile('/tmp/junk.txt')
>>> iter(iter_gen) is iter(iter_gen)
False

>>> list(func_gen)
['the only line in the file\n']
>>> list(func_gen)
[]

>>> list(iter_gen)
['the only line in the file\n']
>>> list(iter_gen)
['the only line in the file\n']

为什么要使用多处理？感觉这只会让事情变慢？如果您可以使用ASCII/UTF-8编码字符串，您会发现我的建议很有用。“使用多处理读取所有行并将它们添加到列表”-为什么？我可以访问群集，并且认为多处理可以使此过程更快多处理模块对群集没有任何作用。如果你想利用你的集群，你需要使用不同的技术，这可能取决于你的集群是如何设置的。对不起，我的意思是访问SMP系统，而不是clustermmap需要在主进程地址空间中有一个连续的内存块，这个内存块对于整个文件对象来说是足够大的，所以对于大文件来说可能是不可行的