Python文本文件处理速度问题_Python_Perl_File Io

Python文本文件处理速度问题

python perl file-io

Python文本文件处理速度问题,python,perl,file-io,Python,Perl,File Io,我在用Python处理大型文件时遇到问题。我所做的就是 f = gzip.open(pathToLog, 'r') for line in f: counter = counter + 1 if (counter % 1000000 == 0): print counter f.close 只需打开文件、读取行并增加此计数器，大约需要10m25s 在perl中，处理同一个文件并做更多的工作（一些正则表达式的东西），整个过程大约需要

我在用Python处理大型文件时遇到问题。我所做的就是

f = gzip.open(pathToLog, 'r')
for line in f:
        counter = counter + 1
        if (counter % 1000000 == 0):
                print counter
f.close

只需打开文件、读取行并增加此计数器，大约需要10m25s

在perl中，处理同一个文件并做更多的工作（一些正则表达式的东西），整个过程大约需要1M17秒

Perl代码：

open(LOG, "/bin/zcat $logfile |") or die "Cannot read $logfile: $!\n";
while (<LOG>) {
        if (m/.*\[svc-\w+\].*login result: Successful\.$/) {
                $_ =~ s/some regex here/$1,$2,$3,$4/;
                push @an_array, $_
        }
}
close LOG;

open（LOG，“/bin/zcat$logfile |”）或die“无法读取$logfile:$！\n”；
而（）{
如果（m/*\[svc-\w+\].*登录结果：成功\.$/）{
$uz=~s/此处的一些正则表达式/$1、$2、$3、$4/；
推送@an_数组$_
}
}
闭合日志；

有人能告诉我如何使Python解决方案以与Perl解决方案类似的速度运行吗

编辑我尝试过解压缩文件并使用open而不是gzip.open来处理它，但这只会将总时间更改为4m14.972s左右，这仍然太慢

我还删除了modulo和print语句，并将它们替换为pass，所以现在所做的就是从一个文件移动到另一个文件

在Python中（至少你的计算机花了10分钟？它一定是你的硬件。我写这个函数是为了写500万行：

def write():
    fout = open('log.txt', 'w')
    for i in range(5000000):
        fout.write(str(i/3.0) + "\n")
    fout.close

然后我用一个很像你的程序读了它：

def read():
    fin = open('log.txt', 'r')
    counter = 0
    for line in fin:
        counter += 1
        if counter % 1000000 == 0:
            print counter
    fin.close

我的电脑花了大约3秒钟的时间阅读了所有500万行。

如果你在谷歌上搜索“为什么python gzip慢”，你会发现很多关于这方面的讨论，包括python 2.7和3.2的改进补丁。与此同时，使用zcat就像在Perl中一样，它非常快。你的（第一个）函数使用5MB的压缩文件大约需要4.19s，第二个函数使用0.78s。但是，我不知道您的未压缩文件发生了什么。如果我解压缩日志文件（apache日志），并使用简单的Python open（file）和Popen（'cat'）在其上运行这两个函数，Python会比cat（0.48s）快（0.17s）

#！/usr/bin/python 导入gzip 从子流程导入管道，Popen 导入系统导入时间信息 #pathToLog='big.log.gz'#50M压缩（*10未压缩）路径日志='small.log.gz'#5M“” def test_ori（）：计数器=0 f=gzip.open（路径日志'r'）对于f中的行：计数器=计数器+1 如果（计数器%100000==0）：#1000000 打印计数器，行 f、接近 def test_new（）：计数器=0 content=Popen（[“zcat”，pathToLog]，stdout=PIPE）。communicate（）[0]。拆分（'\n'）对于内容中的行：计数器=计数器+1 如果（计数器%100000==0）：#1000000 打印计数器，行如果“\uuuuu main\uuuuuuuuu”==\uuuuuuuuu name\uuuuuuuuuu： to=timeit.Timer（'test_ori（）'，'from_uuumain_uuuu导入测试_ori'）将“原始功能时间”打印到.timeit（1） tn=timeit.Timer（'test_new（）'，'from_uuumain_uuu导入测试_new'）打印“新功能时间”，tn.timeit（1）

我花了一段时间在这上面。希望这段代码能起到作用。它使用zlib，没有外部调用

gunzipchunks方法以块的形式读取压缩的gzip文件，这些块可以迭代（生成器）

gunziplines方法读取这些未压缩的块，并一次提供一行代码，该代码也可以迭代（另一个生成器）

最后，gunziplinescounter方法提供了您需要的信息

干杯

import zlib

file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'

#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
    inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
    f = open(file_name,'rb')
    while True:
        packet = f.read(chunk_size)
        if not packet: break
        to_do = inflator.unconsumed_tail + packet
        while to_do:
            decompressed = inflator.decompress(to_do, chunk_size)
            if not decompressed:
                to_do = None
                break
            yield decompressed
            to_do = inflator.unconsumed_tail
    leftovers = inflator.flush()
    if leftovers: yield leftovers
    f.close()

#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
    for chunk in gunzipchunks(file_name): 
        chunk = "".join([leftovers,chunk])
        while line_ending in chunk:
            line, leftovers = chunk.split(line_ending,1)
            yield line
            chunk = leftovers
    if leftovers: yield leftovers

def gunziplinescounter(file_name):
    for counter,line in enumerate(gunziplines(file_name)):
        if (counter % 1000000 != 0): continue
        print "%12s: %10d" % ("checkpoint", counter)
    print "%12s: %10d" % ("final result", counter)
    print "DEBUG: last line: [%s]" % (line)

gunziplinescounter(file_name)

这应该比在超大文件上使用内置gzip模块运行快得多。

尝试使用StringIO缓冲gzip模块的输出。以下读取gzipped pickle的代码将我的代码的执行时间缩短了90%以上

而不是

import cPickle

# Use gzip to open/read the pickle.
lPklFile = gzip.open("test.pkl", 'rb')
lData = cPickle.load(lPklFile)
lPklFile.close()

使用

import cStringIO, cPickle

# Use gzip to open the pickle.
lPklFile = gzip.open("test.pkl", 'rb')

# Copy the pickle into a cStringIO.
lInternalFile = cStringIO.StringIO()
lInternalFile.write(lPklFile.read())
lPklFile.close()

# Set the seek position to the start of the StringIO, and read the
# pickled data from it.
lInternalFile.seek(0, os.SEEK_SET)
lData = cPickle.load(lInternalFile)
lInternalFile.close()

Perl版本运行两个并发进程。在Python慢调用之前，你应该考虑在Python中使用两个进程解决方案。此外，在Python版本中，你在循环中打印一些东西，但不是Perl版本。打印相对慢。@托马斯，但是每100万个记录不应该是问题。我认为每个记录的if和modulo比打印本身更麻烦。@S.Lott实际上并不相关，因为“其他任务”在python版本中，只是增加了一个计数器。并行性带来的好处实际上是0%，因为该作业将有99.99%的空闲。问题在于python的gzip处理。我现在已经尝试在开始之前取消gzip文件，而且python仍然需要大约相同数量级的时间。此外，我真的希望simple modulo不会比perl中的匹配更加计算密集吗？这很有趣，但不是离题了吗？不管这是真是假，OP指出在未压缩文件上循环（因此根本不使用

gzip

模块）Python中的速度也很慢。@ire_和_诅咒——谢谢你指出这一点，我会更新我的答案。关于

gzip.open

，这真的很奇怪。这是设计上的预期行为吗？@Santa——好问题……我想这个模块不是用来处理大量数据的（在小文件中，速度足够快），但我不知道。@Santa：如果你不得不问，那么这是从来没有预料到的行为。但是在你的硬件上用perl阅读需要多长时间？这是OP要求的相对性能，而不是绝对性能。我们在虚拟机上开发，并部署到物理上。就像@Cory说的，我不担心独立性能，我担心的是这里没有Python和Perl之间的区别。@匿名问题您应该使用任何与虚拟机内磁盘接触的性能数字。磁盘IO是虚拟机中最薄弱的部分。我已经看到虚拟机在fsync上锁定了几个小时左右。快速提问…什么版本的Python附带子进程？在我的框中，我得到了Traceback（最近一次调用）：文件“/fileperformance.py”，第4行，在子流程导入管道中，Popen ImportError：没有名为subprocessAh的模块-这非常没有帮助…随2.4和RHE一起提供 #!/usr/bin/python import gzip from subprocess import PIPE, Popen import sys import timeit #pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed) pathToLog = 'small.log.gz' # 5M "" def test_ori(): counter = 0 f = gzip.open(pathToLog, 'r') for line in f: counter = counter + 1 if (counter % 100000 == 0): # 1000000 print counter, line f.close def test_new(): counter = 0 content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n') for line in content: counter = counter + 1 if (counter % 100000 == 0): # 1000000 print counter, line if '__main__' == __name__: to = timeit.Timer('test_ori()', 'from __main__ import test_ori') print "Original function time", to.timeit(1) tn = timeit.Timer('test_new()', 'from __main__ import test_new') print "New function time", tn.timeit(1)

import zlib

file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'

#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
    inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
    f = open(file_name,'rb')
    while True:
        packet = f.read(chunk_size)
        if not packet: break
        to_do = inflator.unconsumed_tail + packet
        while to_do:
            decompressed = inflator.decompress(to_do, chunk_size)
            if not decompressed:
                to_do = None
                break
            yield decompressed
            to_do = inflator.unconsumed_tail
    leftovers = inflator.flush()
    if leftovers: yield leftovers
    f.close()

#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
    for chunk in gunzipchunks(file_name): 
        chunk = "".join([leftovers,chunk])
        while line_ending in chunk:
            line, leftovers = chunk.split(line_ending,1)
            yield line
            chunk = leftovers
    if leftovers: yield leftovers

def gunziplinescounter(file_name):
    for counter,line in enumerate(gunziplines(file_name)):
        if (counter % 1000000 != 0): continue
        print "%12s: %10d" % ("checkpoint", counter)
    print "%12s: %10d" % ("final result", counter)
    print "DEBUG: last line: [%s]" % (line)

gunziplinescounter(file_name)

import cPickle

# Use gzip to open/read the pickle.
lPklFile = gzip.open("test.pkl", 'rb')
lData = cPickle.load(lPklFile)
lPklFile.close()

import cStringIO, cPickle

# Use gzip to open the pickle.
lPklFile = gzip.open("test.pkl", 'rb')

# Copy the pickle into a cStringIO.
lInternalFile = cStringIO.StringIO()
lInternalFile.write(lPklFile.read())
lPklFile.close()

# Set the seek position to the start of the StringIO, and read the
# pickled data from it.
lInternalFile.seek(0, os.SEEK_SET)
lData = cPickle.load(lInternalFile)
lInternalFile.close()