优化python代码以读取文件_Python_Optimization_File Handling

优化python代码以读取文件

python optimization

优化python代码以读取文件,python,optimization,file-handling,Python,Optimization,File Handling,我有以下代码：代码1：我在脚本的另一部分也有这个代码：代码2 TheTimeStamps = [ x.split(' ')[0][1:-1] for x in open(logfile).readlines() ] 这里很清楚，我正在加载两次日志文件。我想避免这种情况。在代码2中，在代码1中，我能做我正在做的事情吗？那么，这样，日志文件只加载一次在代码1中，我搜索日志以确保在不同的行中找到两个非常特定的模式在代码2中，我只提取日志文件中所有行的第一列如何才能更好地优化这一点？我在一个

我有以下代码：代码1：

我在脚本的另一部分也有这个代码：代码2

TheTimeStamps = [ x.split(' ')[0][1:-1] for x in open(logfile).readlines() ]

这里很清楚，我正在加载两次日志文件。我想避免这种情况。在代码2中，在代码1中，我能做我正在做的事情吗？那么，这样，日志文件只加载一次

在代码1中，我搜索日志以确保在不同的行中找到两个非常特定的模式

在代码2中，我只提取日志文件中所有行的第一列

如何才能更好地优化这一点？我在一个日志文件上运行这个脚本，该文件当前大小为480MB，脚本大约在12秒内完成。考虑到这个日志的大小可以达到1GB甚至2GB，我想让它尽可能的高效

更新：

所以@abernert的代码是有效的。我继续向它添加了一个额外的逻辑，现在，它不再工作了。下面是我现在修改过的代码。我在这里主要做的是，如果在日志中找到matchesBegin和matchesEnd中的模式，那么，从matchesBegin到matchesEnd搜索日志，并仅打印包含stringA和stringB的行：

        matchesBegin, matchesEnd = None, None
        beginStr, endStr = str(BeginTimeFirstEpoch).encode(), str(EndinTimeFirstEpoch).encode()
        AllTimeStamps = []
        mylist = []
        with open(logfile, 'rb') as input_data:
            def SearchFirst():
                matchesBegin, matchesEnd = None, None
                for line in input_data:
                    if not matchesBegin:
                        matchesBegin = beginStr in line
                    if not matchesEnd:
                        matchesEnd = endStr in line
                return(matchesBegin, matchesEnd)
            matchesBegin, matchesEndin = SearchFirst()
            #print type(matchesBegin)
            #print type(matchesEndin)
            #if str(matchesBegin) == "True" and str(matchesEnd) == "True":
            if matchesBegin is True and matchesEndin is True:
                rangelines = 0
                for line in input_data:
                    print line
                    if beginStr in line[0:25]:  # Or whatever test is needed
                        rangelines += 1
                        #print line.strip()
                        if re.search(stringA, line) and re.search(stringB, line):
                            mylist.append((line.strip()))
                        break
                for line in input_data:  # This keeps reading the file
                    print line
                    if endStr in line[0:25]:
                        rangelines += 1
                        if re.search(stringA, line) and re.search(stringB, line):
                            mylist.append((line.strip()))
                        break
                    if re.search(stringA, line) and re.search(stringB, line):
                        rangelines += 1
                        mylist.append((line.strip()))
                    else:
                        rangelines += 1
                #return(mylist,rangelines)
                    print(mylist,rangelines)
                    AllTimeStamps.append(line.split(' ')[0][1:-1])

我在上面的代码中做错了什么？

首先，几乎没有好的理由调用

readlines（）

。一个文件已经是一组行，所以你可以在文件上循环；将所有这些行读入内存并建立一个庞大的列表只会浪费时间和内存

另一方面，调用

read（）

，有时会很有用。它确实需要将整个内容作为一个巨大的字符串读入内存，但是在一个巨大的字符串上进行正则表达式搜索可以大大加快速度，与逐行搜索相比，浪费的时间和空间得到了充分的补偿

但是，如果您想将其简化为对文件的一次遍历，因为您已经必须逐行迭代，那么除了逐行执行regex搜索之外，没有其他选择。这应该是可行的（您还没有显示您的模式，但根据名称，我猜它们不会跨越行边界，也不是多行或dotall模式），但它实际上是快还是慢将取决于各种因素

无论如何，这当然值得一试，看看是否有帮助。（而且，在我们进行此操作时，我将使用

with

语句确保您关闭文件，而不是像在第二部分中那样泄漏文件。）

您可以在此处进行一些其他小更改，这可能会有所帮助

我不知道什么是

BeginTimeFirstEpoch

，但您使用的

str（BeginTimeFirstEpoch）

意味着它根本不是正则表达式模式，而是类似于

datetime

对象或

int

？你并不真的需要匹配对象，你只需要知道是否有匹配？如果是这样，您可以删除

regex

并执行简单的子字符串搜索，这会更快一些：

matchesBegin, matchesEnd = None, None
beginStr, endStr = str(BeginTimeFirstEpoch), str(EndinTimeFirstEpoch)
with …
    # …
    if not matchesBegin:
        matchesBegin = beginStr in line
    if not matchesEnd:
        matchesEnd = endStr in line

如果您的搜索字符串和时间戳等都是纯ASCII，则以二进制模式处理文件可能会更快，只解码需要存储的位，而不是所有内容：

matchesBegin, matchesEnd = None, None
beginStr, endStr = str(BeginTimeFirstEpoch).encode(), str(EndinTimeFirstEpoch).encode()
with open(logFile, 'rb') as f:
    # …
    if not matchesBegin:
        matchesBegin = beginStr in line
    if not matchesEnd:
        matchesEnd = endStr in line
    TheTimeStamps.append(line.split(b' ')[0][1:-1].decode())

最后，我怀疑

str.split

在代码中是否接近瓶颈，但是，以防万一……当我们只需要第一次拆分时，为什么还要在所有空格上进行拆分呢

TheTimeStamps.append(line.split(b' ', 1)[0][1:-1].decode())

我用你的建议更新了我原来的帖子。请检查。@RoyMWell您的新版本正在对该文件进行三次迭代。我不明白你为什么要这么做，因为所有的事情都是在一个循环中完成的。这就是问题所在：在input\u data:中第一行的

之后，您位于文件的末尾。因此，当您再次对输入数据中的行执行：时，您得到的是从文件结尾到文件结尾的所有行，也就是说，什么都没有。对于造成的混淆，我深表歉意。我没有发布完整的代码，因为我不想吓跑可能有帮助的潜在用户。这里的最终目标是能够知道是否找到matchesBegin和matchesEnd中的字符串，如果找到，则从这两个字符串之间的日志中打印出条目。问题是，如何将你的建议与我的建议结合起来以实现这一目标？再一次，我对造成的混乱深表歉意。
matchesBegin, matchesEnd = None, None
beginStr, endStr = str(BeginTimeFirstEpoch).encode(), str(EndinTimeFirstEpoch).encode()
with open(logFile, 'rb') as f:
    # …
    if not matchesBegin:
        matchesBegin = beginStr in line
    if not matchesEnd:
        matchesEnd = endStr in line
    TheTimeStamps.append(line.split(b' ')[0][1:-1].decode())

TheTimeStamps.append(line.split(b' ', 1)[0][1:-1].decode())