从一个文本文件中读取数据，该文件是用Python在特定时间段内编写的_Python_Python 3.x_Python 2.7_File_With Statement

从一个文本文件中读取数据，该文件是用Python在特定时间段内编写的

python python-3.x python-2.7 file

从一个文本文件中读取数据，该文件是用Python在特定时间段内编写的,python,python-3.x,python-2.7,file,with-statement,Python,Python 3.x,Python 2.7,File,With Statement,为了详细解释，我有一个文本文件，其中记录了来自不同数量流程实例的一些数据（即可能有4到16个流程实例生成日志）所有实例都以以下格式写入一个文件： 2018-09-07 11:34:47,251 - AppLog - INFO - ****************************************** Log Report - Consume Cycle jhTyjs-ConsumeCycle *******************************

为了详细解释，我有一个文本文件，其中记录了来自不同数量流程实例的一些数据（即可能有4到16个流程实例生成日志）
所有实例都以以下格式写入一个文件：

2018-09-07 11:34:47,251 - AppLog - INFO - 
    ******************************************
    Log Report - Consume Cycle jhTyjs-ConsumeCycle
    ******************************************
    Uptime: 144708.62724542618s
    Jobs Run: 16866
    Jobs Current: 1
    Q Avg Read Time: 0
    Q Msgs Read: 0
    Worker Load: ['1.00', '1.00', '1.00']
    ******************************************

2018-09-07 11:37:47,439 - AppLog - INFO - 
    ******************************************
    Log Report - Consume Cycle aftTys-ConsumeCycle
    ******************************************
    Uptime: 144888.81490063667s
    Jobs Run: 16866
    Jobs Current: 1
    Q Avg Read Time: 0
    Q Msgs Read: 0
    Worker Load: ['1.00', '1.00', '1.00']
    ******************************************

  This is an error line which could be generated by any of the instances and can be anything, <br> like qfuigeececevwovw or wefebew efeofweffhw v wihv or any python \n exception or aiosfgd ceqic eceewfi 

2018-09-07 11:40:47,615 - AppLog - INFO - 
    ******************************************
    Log Report - Consume Cycle hdyGid-ConsumeCycle
    ******************************************
    Uptime: 145068.99103808403s
    Jobs Run: 16866
    Jobs Current: 1
    Q Avg Read Time: 0
    Q Msgs Read: 0
    Worker Load: ['1.00', '1.00', '1.00']
    ******************************************

2018-09-07 11:34:47251-AppLog-INFO-
******************************************
日志报告-消费周期jhTyjs消费周期
******************************************
正常运行时间：144708.62724542618s
作业运行：16866
目前职位：1
Q平均读取时间：0
Q Msgs读取：0
工人负荷：['1.00'，'1.00'，'1.00']
******************************************
2018-09-07 11:37:47439-AppLog-INFO-
******************************************
日志报告-消费周期aftTys消费周期
******************************************
正常运行时间：144888.81490063667s
作业运行：16866
目前职位：1
Q平均读取时间：0
Q Msgs读取：0
工人负荷：['1.00'，'1.00'，'1.00']
******************************************
这是一条错误行，可以由任何实例生成，可以是任何内容，
如QfUIGEECEVWOVW或wefebew EFEOFWEFFWHW v wihv或任何python异常或aiosfgd ceqic eceewfi
2018-09-07 11:40:47615-AppLog-INFO-
******************************************
日志报告-消耗周期hdyGid consumercycle
******************************************
正常运行时间：145068.99103808403s
作业运行：16866
目前职位：1
Q平均读取时间：0
Q Msgs读取：0
工人负荷：['1.00'，'1.00'，'1.00']
******************************************

（在每个日志的

日志报告-消耗周期[placeholder]-消耗周期

中，

[placeholder]

是随机的）
因此，我的文件由大量上述格式的日志组成，一个接一个。每个实例每3分钟生成一次日志。（即所有实例在3分钟内只生成一个日志）
如果任何实例出现错误，它们也会将其记录在同一文件中。因此，数据结构根本不一致。

现在，我必须从所有实例中获取最后记录的数据，即最后3分钟的数据，并对它们执行一些任务。
有没有办法将最后3分钟的数据写入日志文件（无论是错误日志还是上述格式的完美日志）

[编辑]在日志之间添加了一条错误行

您可以在

******************************************\n\n

与

这将为您提供列表中的每个独立记录。您可能需要避开反斜杠。您可以通过切片来获取列表的最后一个元素

print(record_list[-1])

既然您说过文件不会变得太大而无法处理，您就不需要任何花哨的东西来筛选它（即从后面进行缓冲读取）-您只需迭代整个文件，收集单个日志条目并丢弃3分钟前发生的条目

这尤其容易，因为您的条目在开始时的日期时间上彼此明显不同，并且日志日期格式是一种格式，因此您甚至不需要解析日期-您可以使用直接的词典比较

因此，一种方法是：

import datetime

# if your datetime is in UTC use datetime.datetime.utcnow() instead
threshold = datetime.datetime.now() - datetime.timedelta(minutes=3)  # 3m ago
# turn it into a ISO-8601 string
threshold_cmp = threshold.strftime("%Y-%m-%d %H:%M:%S")  # we'll ignore the milliseconds

entries = []
with open("path/to/your.log") as f:  # open your log for reading
    current_date = ""
    current_entry = ""
    for line in f:  # iterate over it line-by-line
        if line[0].isdigit():  # beginning of a (new) log entry
            # store the previous entry if newer than 3 minutes
            if current_date >= threshold_cmp:  # store the previous entry if newer than 3m
                entries.append(current_entry)
            current_date = line[:19]  # store the date of this (new) entry
            current_entry = ""  # (re)initialize the entry
        current_entry += line  # add the current line to the cached entry
    if current_entry and current_date >= threshold_cmp:  # store the leftovers, if any
        entries.append(current_entry)

# now the list 'entries' contains individual entries that occurred in the past 3 minutes
print("".join(entries))  # print them out, or do whatever you want with them

你可以通过辨别占位符来让这变得更容易，但是你已经说过这是一个随机的占位符，所以你必须依赖日期时间

因此，从技术上讲，您希望获得每个实例的文件中的最后一个日志（假设

消费周期

之后是唯一的实例ID）？您是正确的。最后3分钟的数据将提供所有实例的最后日志。可能会出现这样的情况：只生成一个正确的日志，而所有其他实例都生成了一些随机的python错误。当你说最后三分钟时，你想要一个条目列表，并且每个条目都是多行文本吗？@Hogstrom是的，这正是我想要的want@zwer不是真的。日志文件将在一段时间后删除，并创建一个新的日志文件。因此，文件大小是有限的，可以加载到工作内存中。但这将如何决定我是否只有所有实例的最后日志？拆分将提供所有日志，我只需要流程实例中的最后一个日志（可以是4到16个）。@AmitYadav:只需解析时间并丢弃所有太旧的记录。回答得很好。但正如我所提到的，日志文件中也可能有来自任何实例的错误行。我甚至编辑了我的问题并添加了一个随机错误行。@AmitYadav-您通过在日志条目之外（即日期后的10行）来识别错误行（您可能不想收集）？

import datetime

# if your datetime is in UTC use datetime.datetime.utcnow() instead
threshold = datetime.datetime.now() - datetime.timedelta(minutes=3)  # 3m ago
# turn it into a ISO-8601 string
threshold_cmp = threshold.strftime("%Y-%m-%d %H:%M:%S")  # we'll ignore the milliseconds

entries = []
with open("path/to/your.log") as f:  # open your log for reading
    current_date = ""
    current_entry = ""
    for line in f:  # iterate over it line-by-line
        if line[0].isdigit():  # beginning of a (new) log entry
            # store the previous entry if newer than 3 minutes
            if current_date >= threshold_cmp:  # store the previous entry if newer than 3m
                entries.append(current_entry)
            current_date = line[:19]  # store the date of this (new) entry
            current_entry = ""  # (re)initialize the entry
        current_entry += line  # add the current line to the cached entry
    if current_entry and current_date >= threshold_cmp:  # store the leftovers, if any
        entries.append(current_entry)

# now the list 'entries' contains individual entries that occurred in the past 3 minutes
print("".join(entries))  # print them out, or do whatever you want with them