读取gz文件并获取最后24小时的python行
我有三个文件:2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目读取gz文件并获取最后24小时的python行,python,gzip,Python,Gzip,我有三个文件:2.gz文件和1.log文件。这些文件相当大。下面是我原始数据的样本副本。我想提取与过去24小时相对应的条目 a.log.1.gz 2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr 2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom 2018/03/25-20:08:50.486601
a.log.1.gz
2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/25-24:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/26-00:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/26-10:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/26-15:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log.2.gz
2018/03/26-20:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/27-10:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
我不知道如何获得过去24小时的参赛作品
a.log.1.gz
2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/25-24:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/26-00:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/26-10:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/26-15:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log.2.gz
2018/03/26-20:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/27-10:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
我想在最后24小时的数据上运行下面的函数
def _clean_logs(line):
# noinspection SpellCheckingInspection
lemmatizer = WordNetLemmatizer()
clean_line = clean_line.strip()
clean_line = clean_line.lstrip('0123456789.- ')
cleaned_log = " ".join(
[lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(clean_line) if
word not in Stopwords.ENGLISH_STOP_WORDS and 2 < len(word) <= 30 and not word.startswith('_')])
cleaned_log = cleaned_log.replace('"', ' ')
return cleaned_log
这样的办法应该行得通 从datetime导入datetime,timedelta 导入glob 导入gzip 从pathlib导入路径 进口舒蒂尔 def open_文件路径: 如果Pathpath.suffix=='.gz': 返回gzip.openpath,mode='rt',encoding='utf-8' 其他: 返回openpath,编码为='utf-8' def解析的_项行: 对于行中的行: 收益线。拆分“”,maxsplit=1 早期定义: return datetime.now-timedeltahours=24。strftime“%Y/%m/%d-%H:%m:%S” def get_文件: return['a.log']+listreversedsortedglob.glob'a.log.*' output=open'output.log','w',encoding='utf-8' 文件=获取文件 截止时间=更早 对于i,枚举文件中的路径: 将open_文件路径设置为f: 行=已解析的_entriesf 假设您的文件不是空的 日期,行=下一行
如果处理日志文件通常涉及相当大的数据量,则不希望每次都按升序读取,因为这样会浪费大量资源 我认为实现目标的最快方法当然是一个非常简单的随机搜索:我们以相反的顺序搜索日志文件,从最新的开始。您不必访问所有行,只需随意选择一些步长,并且只查看每个步长的一些行。这样,您可以在很短的时间内搜索千兆字节的数据 此外,这种方法不需要在内存中存储文件的每一行,只需要存储一些行和最终结果 当a.log是当前日志文件时,我们将在此处开始搜索:
with open("a.log", "rb+") as fh:
因为我们只对过去24小时感兴趣,所以我们首先跳到末尾,并将要搜索的时间戳保存为格式化字符串:
timestamp = datetime.datetime.now() - datetime.timedelta(days=1) # last 24h
# jump to logfile's end
fh.seek(0, 2) # <-- '2': search relative to file's end
index = fh.tell() # current position in file; here: logfile's *last* byte
由于内容涵盖了文件的所有剩余内容和所有行,这只是在换行符处拆分内容\n我们可以轻松使用筛选器获得所需的结果
行中的每一行都将被送入check_line,如果该行的时间>timestamp,timestamp是我们的datetime对象,精确描述now-1day,则check_line返回True。这意味着对于所有早于时间戳的行,check_line将返回False,而filter将删除这些行
显然,这远不是最优的,但它很容易理解,并且很容易扩展到过滤几分钟、几秒钟
此外,覆盖多个文件也很容易:您只需要glob.glob来查找所有可能的文件,从最新的文件开始并添加另一个循环:您将搜索这些文件,直到我们的while循环第一次失败,然后断开并读取当前文件中的所有剩余内容+以前访问过的所有文件中的所有内容
大概是这样的:
final_lines = lst()
for file in logfiles:
# our while-loop
while True:
...
# if while-loop did not break all of the current logfile's content is
# <24 hours of age
with open(file, "rb+") as fh:
final_lines.extend(fh.readlines())
通过这种方式,您只需存储日志文件的所有行,如果所有行都是24小时运行的,请按final_结果扩展final_行,因为这将只包括行,那么,到目前为止您尝试了什么?@ChrisHunt,我正在尝试读取并附加所有文件。但我最终得到了一个大文件。这个解决方案行不通。因为在我的真实场景中,我有40个大文件。如果我附加了全部40个文件,那么处理速度会变得非常慢。@ChrisHunt,我想用我在问题中添加的函数来清理每一行,该函数出现在最近24小时的文本文件中。你应该用具体信息提出一个新问题!嘿,这太棒了。非常感谢您提供此解决方案。但我已经写了一些代码,它结合了所有的日志,现在我想对其进行处理,以获得最后24小时的内容。您能帮我解决这个问题吗?获取错误类型错误:当我实现这一行时,strtime argument1必须是str,而不是字节。找到了\u time=datetime.datetime.strptime-timestr[0],%Y/%m/%d-%Hdef解析器\u日志文件:反向列表文件中的行的计数器=0:line=line.split'',maxslit 1产生行[0],行[1]counter=counter+1我已经编写了解析器日志来解析条目。请您仔细研究一下,因为当我运行这个函数时,我只得到一个date\u part和line的值。@user15051990这非常接近-但是您需要确保不要执行listfile,因为这将消耗所有的行。我们的解决方案依赖于一次只消耗一行,因为我们使用shutil.copyfileobj来复制文件的其余部分。我已经在我的答案中添加了一个解析_条目的实现,这样你就可以进行比较了。谢谢Chris。还有一个快速的问题,在输出文件中,我只得到最后一个文件文本。它不是从当前位置复制到文件末尾。有什么评论吗?除此之外
在我的例子中不需要。要进行调试,我会这样做:在第一个for循环中添加日志语句以打印每个文件名,在if截止内添加一个
average_line_length = 65
stepsize = 1000
while True:
# we move a step back
fh.seek(index - average_line_length * stepsize, 2)
# save our current position in file
index = fh.tell()
# we try to read a "line" (multiply avg. line length times a number
# large enough to cover even large lines. Ignore largest lines here,
# since this is an edge cases ruining our runtime. We rather skip
# one iteration of the loop then)
r = fh.read(average_line_length * 10)
# our results now contains (on average) multiple lines, so we
# split first
lines = r.split(b"\n")
# now we check for our timestring
for l in lines:
# your timestamps are formatted like '2018/03/28-20:08:48.985053'
# I ignore minutes, seconds, ... here, just for the sake of simplicity
timestr = l.split(b":") # this gives us b'2018/03/28-20' in timestr[0]
# next we convert this to a datetime
found_time = datetime.datetime.strptime(timestr[0], "%Y/%m/%d-%H")
# finally, we compare if the found time is not inside our 24hour margin
if found_time < timestamp:
break
# read in file's contents from current position to end
contents = fh.read()
# split for lines
lines_of_contents = contents.split(b"\n")
# helper function for removing all lines older than 24 hours
def check_line(line):
# split to extract datestr
tstr = line.split(b":")
# convert this to a datetime
ftime = datetime.datetime.strptime(tstr[0], "%Y/%m/%d-%H")
return ftime > timestamp
# remove all lines that are older than 24 hours
final_result = filter(check_line, lines_of_contents)
final_lines = lst()
for file in logfiles:
# our while-loop
while True:
...
# if while-loop did not break all of the current logfile's content is
# <24 hours of age
with open(file, "rb+") as fh:
final_lines.extend(fh.readlines())