Python 3.x 提取最后24小时的日志并将其清理到Python3.x_Python 3.x_Nltk_Gzip

Python 3.x 提取最后24小时的日志并将其清理到Python3.x

python-3.x

Python 3.x 提取最后24小时的日志并将其清理到Python3.x,python-3.x,nltk,gzip,Python 3.x,Nltk,Gzip,我有三个文件：2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目 a.log.1.gz 2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr 2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom 2018/03/25-20:08:50.486601

我有三个文件：2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目

a.log.1.gz

2018/03/25-00:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/25-24:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/26-00:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/26-10:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/26-15:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log.2.gz
2018/03/26-20:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M     7FE9D3D41706     qojfcmqcacaeia
2018/03/27-10:08:50.980519  16K     7FE9BD1AF707     user: number is 93823004
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:

a.log
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom

I am getting the below result but it is not cleaned.
result.txt
2018/03/27-20:08:50.981908 1389     7FE9BDC2B707     user 7fb31ecfa700
2018/03/27-24:08:51.066967    0     7FE9BDC91700     Exit Status = 0x0
2018/03/28-00:08:51.066968    1     7FE9BDC91700     std:ZMD:
2018/03/28-10:08:48.638553  508     7FF4A8F3D704     snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K     7FE9D2D51706     ahelooa afoaona woom

下面的代码拉最后24小时的线

from datetime import datetime, timedelta
import glob
import gzip
from pathlib import Path
import shutil


def open_file(path):
    if Path(path).suffix == '.gz':
        return gzip.open(path, mode='rt', encoding='utf-8')
    else:
        return open(path, encoding='utf-8')


def parsed_entries(lines):
    for line in lines:
        yield line.split(' ', maxsplit=1)


def earlier():
    return (datetime.now() - timedelta(hours=24)).strftime('%Y/%m/%d-%H:%M:%S')


def get_files():
    return ['a.log'] + list(reversed(sorted(glob.glob('a.log.*'))))


output = open('output.log', 'w', encoding='utf-8')


files = get_files()


cutoff = earlier()


for i, path in enumerate(files):
    with open_file(path) as f:
        lines = parsed_entries(f)
        # Assumes that your files are not empty
        date, line = next(lines)
        if cutoff <= date:
            # Skip files that can just be appended to the output later
            continue
        for date, line in lines:
            if cutoff <= date:
                # We've reached the first entry of our file that should be
                # included
                output.write(line)
                break
        # Copies from the current position to the end of the file
        shutil.copyfileobj(f, output)
        break
else:
    # In case ALL the files are within the last 24 hours
    i = len(files)

for path in reversed(files[:i]):
    with open_file(path) as f:
        # Assumes that your files have trailing newlines.
        shutil.copyfileobj(f, output)

# Cleanup, it would get closed anyway when garbage collected or process exits.
output.close()

现在，我想使用able clean函数来清理脏数据。我不知道如何使用它，而拉过去24小时。我想让它既快又有效率。

要回答您问题中的“提高内存效率”部分，您可以使用re模块代替替换模块：

for line in lines:
line = re.sub('[T:-]', '', line)

这将降低代码的复杂性并提供更好的性能

Ok。当我处理最后24小时的数据时，在哪里可以调用clean函数？我不理解你的问题。你是说你创建的函数应该在哪里调用？在这种情况下，您希望在末尾将所有函数包装在def main（）函数下，并从那里调用它们。如果您看到我的一段大代码，我使用的是shutil.copyobj（），这将在日期小于当前日期时复制整个文件以输出。我的清洁功能可以在线工作。所以，我不能在shutil.copyobj（）上使用它。我希望在我的大代码最后24小时提取行时调用clean函数。

for line in lines:
line = re.sub('[T:-]', '', line)