读取gz文件并获取最后24小时的python行_Python_Gzip

读取gz文件并获取最后24小时的python行

python

读取gz文件并获取最后24小时的python行,python,gzip,Python,Gzip,我有三个文件：2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目 a.log.1.gz 2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr 2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom 2018/03/25-20:08:50.486601

我有三个文件：2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目

a.log.1.gz

2018/03/25-00:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/25-24:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/26-00:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/26-10:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/26-15:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log.2.gz
2018/03/26-20:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/27-10:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom

** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom

我不知道如何获得过去24小时的参赛作品

a.log.1.gz

2018/03/25-00:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/25-24:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/26-00:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/26-10:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/26-15:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log.2.gz
2018/03/26-20:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/27-10:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom

** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom

我想在最后24小时的数据上运行下面的函数

def _clean_logs(line):
    # noinspection SpellCheckingInspection
    lemmatizer = WordNetLemmatizer()
    clean_line = clean_line.strip()
    clean_line = clean_line.lstrip('0123456789.- ')
    cleaned_log = " ".join(
        [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(clean_line) if
         word not in Stopwords.ENGLISH_STOP_WORDS and 2 < len(word) <= 30 and not word.startswith('_')])
    cleaned_log = cleaned_log.replace('"', ' ')

    return cleaned_log

这样的办法应该行得通

从datetime导入datetime，timedelta 导入glob 导入gzip 从pathlib导入路径进口舒蒂尔 def open_文件路径：如果Pathpath.suffix=='.gz'：返回gzip.openpath，mode='rt'，encoding='utf-8' 其他：返回openpath，编码为='utf-8' def解析的_项行：对于行中的行：收益线。拆分“”，maxsplit=1 早期定义： return datetime.now-timedeltahours=24。strftime“%Y/%m/%d-%H:%m:%S” def get_文件： return['a.log']+listreversedsortedglob.glob'a.log.*' output=open'output.log'，'w'，encoding='utf-8' 文件=获取文件截止时间=更早对于i，枚举文件中的路径：将open_文件路径设置为f: 行=已解析的_entriesf 假设您的文件不是空的日期，行=下一行

如果处理日志文件通常涉及相当大的数据量，则不希望每次都按升序读取，因为这样会浪费大量资源

我认为实现目标的最快方法当然是一个非常简单的随机搜索：我们以相反的顺序搜索日志文件，从最新的开始。您不必访问所有行，只需随意选择一些步长，并且只查看每个步长的一些行。这样，您可以在很短的时间内搜索千兆字节的数据

此外，这种方法不需要在内存中存储文件的每一行，只需要存储一些行和最终结果

当a.log是当前日志文件时，我们将在此处开始搜索：

with open("a.log", "rb+") as fh:

因为我们只对过去24小时感兴趣，所以我们首先跳到末尾，并将要搜索的时间戳保存为格式化字符串：

timestamp = datetime.datetime.now() - datetime.timedelta(days=1)  # last 24h
# jump to logfile's end
fh.seek(0, 2)  # <-- '2': search relative to file's end
index = fh.tell()  # current position in file; here: logfile's *last* byte

由于内容涵盖了文件的所有剩余内容和所有行，这只是在换行符处拆分内容\n我们可以轻松使用筛选器获得所需的结果

行中的每一行都将被送入check_line，如果该行的时间>timestamp，timestamp是我们的datetime对象，精确描述now-1day，则check_line返回True。这意味着对于所有早于时间戳的行，check_line将返回False，而filter将删除这些行

显然，这远不是最优的，但它很容易理解，并且很容易扩展到过滤几分钟、几秒钟

此外，覆盖多个文件也很容易：您只需要glob.glob来查找所有可能的文件，从最新的文件开始并添加另一个循环：您将搜索这些文件，直到我们的while循环第一次失败，然后断开并读取当前文件中的所有剩余内容+以前访问过的所有文件中的所有内容

大概是这样的：

final_lines = lst()

for file in logfiles:
    # our while-loop
    while True:
       ...
    # if while-loop did not break all of the current logfile's content is
    # <24 hours of age
    with open(file, "rb+") as fh:
        final_lines.extend(fh.readlines())

通过这种方式，您只需存储日志文件的所有行，如果所有行都是24小时运行的，请按final_结果扩展final_行，因为这将只包括行，那么，到目前为止您尝试了什么？@ChrisHunt，我正在尝试读取并附加所有文件。但我最终得到了一个大文件。这个解决方案行不通。因为在我的真实场景中，我有40个大文件。如果我附加了全部40个文件，那么处理速度会变得非常慢。@ChrisHunt，我想用我在问题中添加的函数来清理每一行，该函数出现在最近24小时的文本文件中。你应该用具体信息提出一个新问题！嘿，这太棒了。非常感谢您提供此解决方案。但我已经写了一些代码，它结合了所有的日志，现在我想对其进行处理，以获得最后24小时的内容。您能帮我解决这个问题吗？获取错误类型错误：当我实现这一行时，strtime argument1必须是str，而不是字节。找到了\u time=datetime.datetime.strptime-timestr[0]，%Y/%m/%d-%Hdef解析器\u日志文件：反向列表文件中的行的计数器=0:line=line.split''，maxslit 1产生行[0]，行[1]counter=counter+1我已经编写了解析器日志来解析条目。请您仔细研究一下，因为当我运行这个函数时，我只得到一个date\u part和line的值。@user15051990这非常接近-但是您需要确保不要执行listfile，因为这将消耗所有的行。我们的解决方案依赖于一次只消耗一行，因为我们使用shutil.copyfileobj来复制文件的其余部分。我已经在我的答案中添加了一个解析_条目的实现，这样你就可以进行比较了。谢谢Chris。还有一个快速的问题，在输出文件中，我只得到最后一个文件文本。它不是从当前位置复制到文件末尾。有什么评论吗？除此之外

在我的例子中不需要。要进行调试，我会这样做：在第一个for循环中添加日志语句以打印每个文件名，在if截止内添加一个

average_line_length = 65
stepsize = 1000

while True:
    # we move a step back
    fh.seek(index - average_line_length * stepsize, 2)

    # save our current position in file
    index = fh.tell()

    # we try to read a "line" (multiply avg. line length times a number
    # large enough to cover even large lines. Ignore largest lines here,
    # since this is an edge cases ruining our runtime. We rather skip
    # one iteration of the loop then)
    r = fh.read(average_line_length * 10)

    # our results now contains (on average) multiple lines, so we
    # split first
    lines = r.split(b"\n")

    # now we check for our timestring
    for l in lines:
        # your timestamps are formatted like '2018/03/28-20:08:48.985053'
        # I ignore minutes, seconds, ... here, just for the sake of simplicity
        timestr = l.split(b":")  # this gives us b'2018/03/28-20' in timestr[0]

        # next we convert this to a datetime
        found_time = datetime.datetime.strptime(timestr[0], "%Y/%m/%d-%H")

        # finally, we compare if the found time is not inside our 24hour margin
        if found_time < timestamp:
            break

# read in file's contents from current position to end
contents = fh.read()

# split for lines
lines_of_contents = contents.split(b"\n")

# helper function for removing all lines older than 24 hours
def check_line(line):
    # split to extract datestr
    tstr = line.split(b":")
    # convert this to a datetime
    ftime = datetime.datetime.strptime(tstr[0], "%Y/%m/%d-%H")

    return ftime > timestamp

# remove all lines that are older than 24 hours
final_result = filter(check_line, lines_of_contents)

final_lines = lst()

for file in logfiles:
    # our while-loop
    while True:
       ...
    # if while-loop did not break all of the current logfile's content is
    # <24 hours of age
    with open(file, "rb+") as fh:
        final_lines.extend(fh.readlines())