Python 处理大文件的最快方法？_Python_File_Python 2.7_Filereader

Python 处理大文件的最快方法？

python file python-2.7

Python 处理大文件的最快方法？,python,file,python-2.7,filereader,Python,File,Python 2.7,Filereader,我有多个3 GB制表符分隔的文件。每个文件中有2000万行。所有行都必须独立处理，任何两行之间都没有关系。我的问题是，什么会更快A.使用以下方法逐行阅读： with open() as infile: for line in infile: 或者B.将文件以块的形式读入内存并进行处理，比如每次250 MB 处理不是很复杂，我只是获取列1到List1、列2到List2等中的值。可能需要同时添加一些列值我在一个有30GB内存的linux机器上使用Python2.7。ASCII文本有没

我有多个3 GB制表符分隔的文件。每个文件中有2000万行。所有行都必须独立处理，任何两行之间都没有关系。我的问题是，什么会更快A.使用以下方法逐行阅读：

with open() as infile:
    for line in infile:

或者B.将文件以块的形式读入内存并进行处理，比如每次250 MB

处理不是很复杂，我只是获取列1到

List1

、列2到

List2

等中的值。可能需要同时添加一些列值

我在一个有30GB内存的linux机器上使用Python2.7。ASCII文本

有没有办法同时加快速度？现在我使用的是前一种方法，过程非常缓慢。使用任何

CSVReader

模块是否有帮助？

我不必用python来做，任何其他语言或数据库使用想法都是受欢迎的。

听起来您的代码是I/O绑定的。这意味着，如果您花费90%的时间从磁盘读取数据，多处理将不会有任何帮助，让额外的7个进程等待下一次读取也不会有任何帮助

而且，虽然使用CSV读取模块（无论是stdlib的

CSV

还是类似于NumPy或Pandas的东西）可能是一个简单的好主意，但它不太可能在性能上产生太大的差异

尽管如此，还是值得检查您是否真的受I/O限制，而不仅仅是猜测。运行你的程序，看看你的CPU使用率是接近0%，还是接近100%，或者是一个内核。按照Amadan在评论中的建议去做，然后运行你的程序，只需通过

pass

进行处理，看看这是缩短了5%的时间还是70%的时间。您甚至可以尝试与

os.open

和

os.read（1024*1024）

之类的循环进行比较，看看是否更快

由于您使用的是Python2.x，Python依赖于C stdio库来猜测一次要缓冲多少，因此强制它缓冲更多可能是值得的。最简单的方法是对一些较大的

bufsize

使用

readlines（bufsize）

。（您可以尝试不同的数字，并测量它们以确定峰值位置。根据我的经验，通常64K-8MB之间的任何数据都大致相同，但取决于您的系统，这些数据可能会有所不同，尤其是当您正在读取一个网络文件系统时，该文件系统的吞吐量非常高，但延迟非常可怕，这会淹没实际的吞吐量与延迟。）l物理驱动器和操作系统的缓存。）

例如：

bufsize = 65536
with open(path) as infile: 
    while True:
        lines = infile.readlines(bufsize)
        if not lines:
            break
        for line in lines:
            process(line)

with open(path) as infile:
    m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)

同时，假设您在一个64位系统上，您可能首先想尝试使用而不是读取文件。这当然不能保证会更好，但可能会更好，这取决于您的系统。例如：

bufsize = 65536
with open(path) as infile: 
    while True:
        lines = infile.readlines(bufsize)
        if not lines:
            break
        for line in lines:
            process(line)

with open(path) as infile:
    m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)

Python

mmap

是一种奇怪的对象，它的行为类似于

str

，同时又类似于

文件

，因此您可以（例如）手动重复扫描换行符，或者您可以对其调用

readline

，就好像它是一个文件一样。这两种方法都需要Python进行更多的处理，而不是将文件作为行进行迭代或批处理

readlines

（因为C中的循环现在是纯Python中的循环……尽管您可以使用

re

，或者使用简单的Cython扩展名来解决这个问题？）…但是，操作系统的I/O优势——知道您正在使用映射做什么——可能会淹没CPU的劣势

不幸的是，Python没有公开用于在C语言中进行优化的调用（例如，显式设置

MADV_SEQUENTIAL

，而不是进行内核猜测，或者强制使用透明的大页面）-但是你实际上可以

ctypes

函数出

libc

我知道这个问题很老；但我想做一件类似的事情，我创建了一个简单的框架，它可以帮助您并行地读取和处理一个大文件。留下我尝试过的答案

这是代码，最后我给出了一个例子

def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
    """
    function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned

    Params : 
        fname : path to the file to be chunked
        size : size of each chink is ~> this
        skiplines : number of lines in the begining to skip, -1 means don't skip any lines
    Returns : 
        start and end position of chunks in Bytes
    """
    chunks = []
    fileEnd = os.path.getsize(fname)
    with open(fname, "rb") as f:
        if(skiplines > 0):
            for i in range(skiplines):
                f.readline()

        chunkEnd = f.tell()
        count = 0
        while True:
            chunkStart = chunkEnd
            f.seek(f.tell() + size, os.SEEK_SET)
            f.readline()  # make this chunk line aligned
            chunkEnd = f.tell()
            chunks.append((chunkStart, chunkEnd - chunkStart, fname))
            count+=1

            if chunkEnd > fileEnd:
                break
    return chunks

def parallel_apply_line_by_line_chunk(chunk_data):
    """
    function to apply a function to each line in a chunk

    Params :
        chunk_data : the data for this chunk 
    Returns :
        list of the non-None results for this chunk
    """
    chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
    func_args = chunk_data[4:]

    t1 = time.time()
    chunk_res = []
    with open(file_path, "rb") as f:
        f.seek(chunk_start)
        cont = f.read(chunk_size).decode(encoding='utf-8')
        lines = cont.splitlines()

        for i,line in enumerate(lines):
            ret = func_apply(line, *func_args)
            if(ret != None):
                chunk_res.append(ret)
    return chunk_res

def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
    """
    function to apply a supplied function line by line in parallel

    Params :
        input_file_path : path to input file
        chunk_size_factor : size of 1 chunk in MB
        num_procs : number of parallel processes to spawn, max used is num of available cores - 1
        skiplines : number of top lines to skip while processing
        func_apply : a function which expects a line and outputs None for lines we don't want processed
        func_args : arguments to function func_apply
        fout : do we want to output the processed lines to a file
    Returns :
        list of the non-None results obtained be processing each line
    """
    num_parallel = min(num_procs, psutil.cpu_count()) - 1

    jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)

    jobs = [list(x) + [func_apply] + func_args for x in jobs]

    print("Starting the parallel pool for {} jobs ".format(len(jobs)))

    lines_counter = 0

    pool = mp.Pool(num_parallel, maxtasksperchild=1000)  # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering

    outputs = []
    for i in range(0, len(jobs), num_parallel):
        print("Chunk start = ", i)
        t1 = time.time()
        chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])

        for i, subl in enumerate(chunk_outputs):
            for x in subl:
                if(fout != None):
                    print(x, file=fout)
                else:
                    outputs.append(x)
                lines_counter += 1
        del(chunk_outputs)
        gc.collect()
        print("All Done in time ", time.time() - t1)

    print("Total lines we have = {}".format(lines_counter))

    pool.close()
    pool.terminate()
    return outputs

比如说，我有一个文件，其中我想计算每行中的字数，然后每行的处理过程如下

def count_words_line(line):
    return len(line.strip().split())

然后调用如下函数：

parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)

使用这种方法，我的速度比普通的逐行读取大小为20GB的示例文件的速度提高了约8倍。在这种文件中，我对每行执行一些中等复杂的处理。

您的代码是受I/O限制的，还是受CPU限制的？换言之，处理比阅读花费更多的时间吗？如果是这样，您可能可以通过多处理来加速它；如果没有，您的后台进程将花费所有时间等待下一次读取，您将得不到任何好处。同时，

for line in infle:

已经在

io

模块代码（在Python 3.1+中）或下面的C stdio（在Python 2.x中）中进行了适当的缓冲，因此除非您使用的是Python 3.0，应该没问题。但是，如果你想强制它使用更大的缓冲区，你可以总是循环，比如说，

infle.readlines（65536）

，然后循环每个块中的行。此外，这可能是2.x还是3.x，哪个3.x版本如果3.x，你在哪个平台上，无论这是ASCII文本还是真正需要解码的内容，请添加这些信息。@abarnert最多只能说是“体面的”。如果他/她有足够的内存并且不关心3GB的命中率，他/她可以对infle.readlines（）中的行执行

：

，这将比文件对象更快地迭代itself@Vincenzzzochi事实上，我个人在处理“大数据”方面有很多经验使用Python，如果您正确设计了解决方案，它的表现会非常好；同样，这取决于问题的性质，CPU限制与I/O限制或两者兼而有之。Python并没有那么慢：）我在linux机器上有30 GB的内存。执行readlines（）将整个文件放入内存有什么问题吗？@Reise45:这取决于您所说的“问题”是什么意思。它应该起作用

[file]相关文章推荐

随机文章推荐