Python 3.x 提取最后24小时的日志并将其清理到Python3.x
我有三个文件:2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目Python 3.x 提取最后24小时的日志并将其清理到Python3.x,python-3.x,nltk,gzip,Python 3.x,Nltk,Gzip,我有三个文件:2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目 a.log.1.gz 2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr 2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom 2018/03/25-20:08:50.486601
a.log.1.gz
2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/25-24:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/26-00:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/26-10:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/26-15:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log.2.gz
2018/03/26-20:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/27-10:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
I am getting the below result but it is not cleaned.
result.txt
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
下面的代码拉最后24小时的线
from datetime import datetime, timedelta
import glob
import gzip
from pathlib import Path
import shutil
def open_file(path):
if Path(path).suffix == '.gz':
return gzip.open(path, mode='rt', encoding='utf-8')
else:
return open(path, encoding='utf-8')
def parsed_entries(lines):
for line in lines:
yield line.split(' ', maxsplit=1)
def earlier():
return (datetime.now() - timedelta(hours=24)).strftime('%Y/%m/%d-%H:%M:%S')
def get_files():
return ['a.log'] + list(reversed(sorted(glob.glob('a.log.*'))))
output = open('output.log', 'w', encoding='utf-8')
files = get_files()
cutoff = earlier()
for i, path in enumerate(files):
with open_file(path) as f:
lines = parsed_entries(f)
# Assumes that your files are not empty
date, line = next(lines)
if cutoff <= date:
# Skip files that can just be appended to the output later
continue
for date, line in lines:
if cutoff <= date:
# We've reached the first entry of our file that should be
# included
output.write(line)
break
# Copies from the current position to the end of the file
shutil.copyfileobj(f, output)
break
else:
# In case ALL the files are within the last 24 hours
i = len(files)
for path in reversed(files[:i]):
with open_file(path) as f:
# Assumes that your files have trailing newlines.
shutil.copyfileobj(f, output)
# Cleanup, it would get closed anyway when garbage collected or process exits.
output.close()
现在,我想使用able clean函数来清理脏数据。我不知道如何使用它,而拉过去24小时。我想让它既快又有效率。要回答您问题中的“提高内存效率”部分,您可以使用re模块代替替换模块:
for line in lines:
line = re.sub('[T:-]', '', line)
这将降低代码的复杂性并提供更好的性能Ok。当我处理最后24小时的数据时,在哪里可以调用clean函数?我不理解你的问题。你是说你创建的函数应该在哪里调用?在这种情况下,您希望在末尾将所有函数包装在def main()函数下,并从那里调用它们。如果您看到我的一段大代码,我使用的是shutil.copyobj(),这将在日期小于当前日期时复制整个文件以输出。我的清洁功能可以在线工作。所以,我不能在shutil.copyobj()上使用它。我希望在我的大代码最后24小时提取行时调用clean函数。
for line in lines:
line = re.sub('[T:-]', '', line)