Python 最快的方式；grep“；大文件_Python_Python 2.7

Python 最快的方式；grep“；大文件

python python-2.7

Python 最快的方式；grep“；大文件,python,python-2.7,Python,Python 2.7,我有很大的日志文件（从100MB到2GB），其中包含一行（单个）我需要在Python程序中解析的特定行。我必须解析大约20000个文件。我知道搜索的行在文件的最后200行内，或者在最后15000字节内由于这是一项经常性的任务，我需要它尽可能快。最快的方式是什么我考虑了4种策略：用Python读取整个文件并搜索正则表达式（方法_1）仅读取文件的最后15000字节并搜索正则表达式（方法2）对grep进行系统调用（方法3）在跟踪最后200行后对grep进行系统调用（方法4）以下是我为

我有很大的日志文件（从100MB到2GB），其中包含一行（单个）我需要在Python程序中解析的特定行。我必须解析大约20000个文件。我知道搜索的行在文件的最后200行内，或者在最后15000字节内

由于这是一项经常性的任务，我需要它尽可能快。最快的方式是什么

我考虑了4种策略：

用Python读取整个文件并搜索正则表达式（方法_1）
仅读取文件的最后15000字节并搜索正则表达式（方法2）
对grep进行系统调用（方法3）
在跟踪最后200行后对grep进行系统调用（方法4）

以下是我为测试这些策略而创建的函数：

import os
import re
import subprocess

def method_1(filename):
    """Method 1: read whole file and regex"""
    regex = r'\(TEMPS CP :[ ]*.*S\)'
    with open(filename, 'r') as f:
        txt = f.read()
    match = re.search(regex, txt)
    if match:
        print match.group()

def method_2(filename):
    """Method 2: read part of the file and regex"""
    regex = r'\(TEMPS CP :[ ]*.*S\)'
    with open(filename, 'r') as f:
        size = min(15000, os.stat(filename).st_size)
        f.seek(-size, os.SEEK_END)
        txt = f.read(size)
        match = re.search(regex, txt)
        if match:
            print match.group()

def method_3(filename):
    """Method 3: grep the entire file"""
    cmd = 'grep "(TEMPS CP :" {} | head -n 1'.format(filename)
    process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
    print process.communicate()[0][:-1]

def method_4(filename):
    """Method 4: tail of the file and grep"""
    cmd = 'tail -n 200 {} | grep "(TEMPS CP :"'.format(filename)
    process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
    print process.communicate()[0][:-1]

我在两个文件上运行了这些方法（“trace”是207MB，“trace_big”是1.9GB），得到了以下计算时间（以秒为单位）：

所以方法2似乎是最快的。但是有没有其他我没有想到的解决办法呢

编辑除了前面的方法外，Gosha F建议使用mmap的第五种方法：

import contextlib
import math
import mmap

def method_5(filename):
    """Method 5: use memory mapping and regex"""
    regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
    offset = max(0, os.stat(filename).st_size - 15000)
    ag = mmap.ALLOCATIONGRANULARITY
    offset = ag * (int(math.ceil(offset/ag)))
    with open(filename, 'r') as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)
        with contextlib.closing(mm) as txt:
            match = regex.search(txt)
            if match:
                print match.group()

我对它进行了测试，得到了以下结果：

+----------+-----------+-----------+
|          |   trace   | trace_big |
+----------+-----------+-----------+
| method_5 | 2.50E-004 | 2.71E-004 |
+----------+-----------+-----------+

在shell中进行处理可能会更快，以避免python开销。然后，您可以将结果通过管道传输到python脚本中。否则看起来你做得最快

然后寻找正则表达式匹配应该是非常快的。方法2和方法4是相同的，但会导致python进行系统调用的额外开销。

它必须是python吗？为什么不是shell脚本？

我猜方法4将是最快/最有效的。这当然是我作为shell脚本编写它的方式。它的速度比1或3快。与方法2相比，我仍然可以用它来确定100%。

，您也可以考虑使用内存映射（模块），如

。

def method_5(filename):
    """Method 5: use memory mapping and regex"""
    regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
    offset = max(0, os.stat(filename).st_size - 15000)
    with open(filename, 'r') as f:
        with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)) as txt:
            match = regex.search(txt)
            if match:
                print match.group()

还有一些旁注：

在使用shell命令的情况下，在某些情况下可能比grep快几个数量级（尽管只有200行greppable文本，与启动shell的开销相比，差异可能会消失）
只需在函数开头编译正则表达式可能会有所不同

我需要在Python程序中得到这个值，因为它只是Python中编写的一个计算代码管理代码的一小部分。在这种情况下，我仍然认为把它写为shell脚本。根据性能，您可以找到Python实现。例如，您可以使用一个shell脚本来查找行，并将值写入另一个文件。然后让Python调用脚本，并从文件中读取值。运行外部脚本并获取其返回是否比使用子进程运行命令更快？方法4在shell中最快，因为方法2无法在shell中实现。方法2在python中是最快的，因为它不涉及spawnig子进程。我刚刚做了测试：结果与方法4的结果相同。您可能想用它来度量这样的代码片段。事实上，我的测试是通过来自fast\u read import Method\u 1 as Method的>>>python-m timeit'运行的；方法（“trace”）'我想知道，如果您没有仅读取最后15000字节或200行的限制，这些（5）将如何处理。ie：如果你只是将其用于“典型的grep”类型的场景，并搜索两种大小的文件..很好的解决方案：与方法2相比，它提高了计算时间。请注意：在

mmap.mmap

中，偏移量必须是mmap.ALLOCATIONGRANULARITY的倍数，因此我在其最接近的上限倍数上加了一个整数。

def method_5(filename):
    """Method 5: use memory mapping and regex"""
    regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
    offset = max(0, os.stat(filename).st_size - 15000)
    with open(filename, 'r') as f:
        with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)) as txt:
            match = regex.search(txt)
            if match:
                print match.group()