读取gz文件并获取最后24小时的python行

读取gz文件并获取最后24小时的python行,python,gzip,Python,Gzip,我有三个文件:2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目 a.log.1.gz 2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr 2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom 2018/03/25-20:08:50.486601

我有三个文件:2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目

a.log.1.gz

2018/03/25-00:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/25-24:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/26-00:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/26-10:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/26-15:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log.2.gz
2018/03/26-20:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/27-10:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom

** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
我不知道如何获得过去24小时的参赛作品

a.log.1.gz

2018/03/25-00:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/25-24:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/26-00:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/26-10:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/26-15:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log.2.gz
2018/03/26-20:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/27-10:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom

** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
我想在最后24小时的数据上运行下面的函数

def _clean_logs(line):
    # noinspection SpellCheckingInspection
    lemmatizer = WordNetLemmatizer()
    clean_line = clean_line.strip()
    clean_line = clean_line.lstrip('0123456789.- ')
    cleaned_log = " ".join(
        [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(clean_line) if
         word not in Stopwords.ENGLISH_STOP_WORDS and 2 < len(word) <= 30 and not word.startswith('_')])
    cleaned_log = cleaned_log.replace('"', ' ')

    return cleaned_log

这样的办法应该行得通

从datetime导入datetime,timedelta 导入glob 导入gzip 从pathlib导入路径 进口舒蒂尔 def open_文件路径: 如果Pathpath.suffix=='.gz': 返回gzip.openpath,mode='rt',encoding='utf-8' 其他: 返回openpath,编码为='utf-8' def解析的_项行: 对于行中的行: 收益线。拆分“”,maxsplit=1 早期定义: return datetime.now-timedeltahours=24。strftime“%Y/%m/%d-%H:%m:%S” def get_文件: return['a.log']+listreversedsortedglob.glob'a.log.*' output=open'output.log','w',encoding='utf-8' 文件=获取文件 截止时间=更早 对于i,枚举文件中的路径: 将open_文件路径设置为f: 行=已解析的_entriesf 假设您的文件不是空的 日期,行=下一行
如果处理日志文件通常涉及相当大的数据量,则不希望每次都按升序读取,因为这样会浪费大量资源

我认为实现目标的最快方法当然是一个非常简单的随机搜索:我们以相反的顺序搜索日志文件,从最新的开始。您不必访问所有行,只需随意选择一些步长,并且只查看每个步长的一些行。这样,您可以在很短的时间内搜索千兆字节的数据

此外,这种方法不需要在内存中存储文件的每一行,只需要存储一些行和最终结果

当a.log是当前日志文件时,我们将在此处开始搜索:

with open("a.log", "rb+") as fh:
因为我们只对过去24小时感兴趣,所以我们首先跳到末尾,并将要搜索的时间戳保存为格式化字符串:

timestamp = datetime.datetime.now() - datetime.timedelta(days=1)  # last 24h
# jump to logfile's end
fh.seek(0, 2)  # <-- '2': search relative to file's end
index = fh.tell()  # current position in file; here: logfile's *last* byte
由于内容涵盖了文件的所有剩余内容和所有行,这只是在换行符处拆分内容\n我们可以轻松使用筛选器获得所需的结果

行中的每一行都将被送入check_line,如果该行的时间>timestamp,timestamp是我们的datetime对象,精确描述now-1day,则check_line返回True。这意味着对于所有早于时间戳的行,check_line将返回False,而filter将删除这些行

显然,这远不是最优的,但它很容易理解,并且很容易扩展到过滤几分钟、几秒钟

此外,覆盖多个文件也很容易:您只需要glob.glob来查找所有可能的文件,从最新的文件开始并添加另一个循环:您将搜索这些文件,直到我们的while循环第一次失败,然后断开并读取当前文件中的所有剩余内容+以前访问过的所有文件中的所有内容

大概是这样的:

final_lines = lst()

for file in logfiles:
    # our while-loop
    while True:
       ...
    # if while-loop did not break all of the current logfile's content is
    # <24 hours of age
    with open(file, "rb+") as fh:
        final_lines.extend(fh.readlines())

通过这种方式,您只需存储日志文件的所有行,如果所有行都是24小时运行的,请按final_结果扩展final_行,因为这将只包括行,那么,到目前为止您尝试了什么?@ChrisHunt,我正在尝试读取并附加所有文件。但我最终得到了一个大文件。这个解决方案行不通。因为在我的真实场景中,我有40个大文件。如果我附加了全部40个文件,那么处理速度会变得非常慢。@ChrisHunt,我想用我在问题中添加的函数来清理每一行,该函数出现在最近24小时的文本文件中。你应该用具体信息提出一个新问题!嘿,这太棒了。非常感谢您提供此解决方案。但我已经写了一些代码,它结合了所有的日志,现在我想对其进行处理,以获得最后24小时的内容。您能帮我解决这个问题吗?获取错误类型错误:当我实现这一行时,strtime argument1必须是str,而不是字节。找到了\u time=datetime.datetime.strptime-timestr[0],%Y/%m/%d-%Hdef解析器\u日志文件:反向列表文件中的行的计数器=0:line=line.split'',maxslit 1产生行[0],行[1]counter=counter+1我已经编写了解析器日志来解析条目。请您仔细研究一下,因为当我运行这个函数时,我只得到一个date\u part和line的值。@user15051990这非常接近-但是您需要确保不要执行listfile,因为这将消耗所有的行。我们的解决方案依赖于一次只消耗一行,因为我们使用shutil.copyfileobj来复制文件的其余部分。我已经在我的答案中添加了一个解析_条目的实现,这样你就可以进行比较了。谢谢Chris。还有一个快速的问题,在输出文件中,我只得到最后一个文件文本。它不是从当前位置复制到文件末尾。有什么评论吗?除此之外
在我的例子中不需要。要进行调试,我会这样做:在第一个for循环中添加日志语句以打印每个文件名,在if截止内添加一个
average_line_length = 65
stepsize = 1000

while True:
    # we move a step back
    fh.seek(index - average_line_length * stepsize, 2)

    # save our current position in file
    index = fh.tell()

    # we try to read a "line" (multiply avg. line length times a number
    # large enough to cover even large lines. Ignore largest lines here,
    # since this is an edge cases ruining our runtime. We rather skip
    # one iteration of the loop then)
    r = fh.read(average_line_length * 10)

    # our results now contains (on average) multiple lines, so we
    # split first
    lines = r.split(b"\n")

    # now we check for our timestring
    for l in lines:
        # your timestamps are formatted like '2018/03/28-20:08:48.985053'
        # I ignore minutes, seconds, ... here, just for the sake of simplicity
        timestr = l.split(b":")  # this gives us b'2018/03/28-20' in timestr[0]

        # next we convert this to a datetime
        found_time = datetime.datetime.strptime(timestr[0], "%Y/%m/%d-%H")

        # finally, we compare if the found time is not inside our 24hour margin
        if found_time < timestamp:
            break
# read in file's contents from current position to end
contents = fh.read()

# split for lines
lines_of_contents = contents.split(b"\n")

# helper function for removing all lines older than 24 hours
def check_line(line):
    # split to extract datestr
    tstr = line.split(b":")
    # convert this to a datetime
    ftime = datetime.datetime.strptime(tstr[0], "%Y/%m/%d-%H")

    return ftime > timestamp

# remove all lines that are older than 24 hours
final_result = filter(check_line, lines_of_contents)
final_lines = lst()

for file in logfiles:
    # our while-loop
    while True:
       ...
    # if while-loop did not break all of the current logfile's content is
    # <24 hours of age
    with open(file, "rb+") as fh:
        final_lines.extend(fh.readlines())