Python-在日志文件上使用“二进制搜索”；可疑“；时间_Python_Search_Logging

Python-在日志文件上使用“二进制搜索”；可疑“；时间

python search logging

Python-在日志文件上使用“二进制搜索”；可疑“；时间,python,search,logging,Python,Search,Logging,有没有一种方法可以使用Python对日志文件中的“可疑时间”进行有效的二进制搜索我有一个日志文件，其条目如下所示： 02:38:18 0 RcvTxData - 11 : Telegram received and process completed - MCP35 Tx -24239 02:38:20 0 RcvNewTxNo - 3 : MCP36 Set receive trigger 02:38:21 0 RcvNewTxNo - 1 : 02:38:21 0 Rcv

有没有一种方法可以使用

Python

对日志文件中的“可疑时间”进行有效的二进制搜索

我有一个日志文件，其条目如下所示：

02:38:18  0  RcvTxData - 11 : Telegram received and process completed - MCP35 Tx -24239
02:38:20  0  RcvNewTxNo - 3 : MCP36 Set receive trigger
02:38:21  0  RcvNewTxNo - 1 : 
02:38:21  0  RcvNewTxNo - 1 : MCP35 get new Tx 24241
02:38:23  0  RcvTxData - 11 : Telegram received and process completed - MCP36 Tx -13918
02:38:23  0  RcvNewTxNo - 3 : MCP36 Set receive trigger
02:38:24  0  RcvNewTxNo - 1 : 
02:38:24  0  RcvTxData - 11 : Telegram received and process completed - MCP35 Tx -24241
02:38:24  0  RcvNewTxNo - 3 : MCP35 Set receive trigger
02:38:27  0  RcvNewTxNo - 1 : 
02:38:27  0  RcvNewTxNo - 1 : MCP36 get new Tx 13920
09:44:54  0  RcvNewTxNo - 1 : 
09:44:54  0  RcvNewTxNo - 1 : MCP24 get new Tx 17702
09:44:54  0  RcvNewTxNo - 2 : MCP24 Read last Tx before new Tx 17702
09:44:56  0  RcvNewTxNo - 1 : 
09:45:00  0  RcvTxData - 7 :MCP24 Prepare normal TxData to DB
09:45:01  0  RcvTxData - 8 :MCP24 complete call GetTxData
09:45:02  0  RcvTxData - 11 : Telegram received and process completed - MCP10 Tx -9008
09:45:02  0  RcvNewTxNo - 3 : MCP10 Set receive trigger
09:45:04  0  RcvNewTxNo - 1 : 
09:45:04  0  RcvNewTxNo - 3 : MCP24 Set receive trigger
09:45:16  0  RcvNewTxNo - 1 : 
09:45:16  0  RcvNewTxNo - 1 : MCP19 get new Tx 9133
09:45:16  0  RcvNewTxNo - 2 : MCP19 Read last Tx before new Tx 9133
09:45:17  0  RcvTxData - 1 :MCP19 gwTx-9133 lastTx-9131 newTx-0
09:45:17  0  RcvTxData - 4 :MCP19 Adjusted newTxNo_Val-9132
09:45:17  0  RcvTxData - 4.1 :MCP19 FnCode PF
09:45:23  0  RcvTxData - 1 :MCP24 gwTx-17706 lastTx-17704 newTx-0

from functools import partial
import datetime
jump_gap = 56 * 10000 # average row length * how many rows you want to jump

def f_jump(fh, jump_size):
    fh.seek(fh.tell() + jump_size)
    while fh.read(1) not in ('\n', ''):
        continue
    return True

with open('log.txt', 'rb') as fh:
    prev_time = datetime.datetime.strptime(fh.read(5).decode('utf-8'), '%H:%M')
    f_jump(fh, jump_gap) # Jump to the next jump since we got a starting time

    for chunk in iter(partial(fh.read, 5), ''): # <-- Note here! We only read 5 bytes
                                                                 # there for it's very important we check for new
                                                                 # rows manually and set the pointed at the start
                                                                 # of a new line, this is what `f_jump()` does later.
        if chunk == '':
            break # we clearly hit rock bottom

        t = datetime.datetime.strptime(chunk.decode('utf-8'), '%H:%M')
        if (t - prev_time).total_seconds() > 3600:
            print('\tSuspicious time:', t, '\ttold time:', prev_time, '\tat: ', fh.tell())

        prev_time = t
        f_jump(fh, jump_gap)

从上面的示例中可以看到，日志的时间是非递减的，时间可能会突然跳变：

02:38:27  0  RcvNewTxNo - 1 : MCP36 get new Tx 13920
09:44:54  0  RcvNewTxNo - 1 : #there is a big jump here

我的目标是检测这个可疑的行，返回它的行和索引

我创建了一个函数来检测这个“可疑时间”。但是，日志文件的大小约为

到

行。因此，我的算法非常慢，因为我从一行到另一行：

f = open(fp, "r")
notEmpty = True
oldTime = None
while(notEmpty): #this can be executed 22,000 - 44,000 times
    l = f.readline()
    notEmpty = l != ""
    if not notEmpty:
        break
    t = datetime.datetime.strptime(l[0:8], fmt)
    if oldTime is None:
        oldTime = t
    else:
        tdelta = t - oldTime
        if tdelta.seconds > 3600: #more than 1 hour is suspicious
            print("suspicious time: " + str(t) + "\told time: " + str(oldTime))
        oldTime = t

有没有什么方法可以加快搜索速度，比如用Python对日志文件进行二进制搜索

（注：建议使用除二进制搜索以外的任何其他搜索，只要它比暴力搜索更好，也同样受到赞赏）

编辑：

我已部分实现了的解决方案（并修复了一些错误）：

然而，正如他在回答中提出的一些“黑客”一样，我还想添加一些文件的附加特征，这些特征可能值得“黑客”以增加性能优势：

如果没有可疑时间，则整个文件的时间戳从第一个条目到最后一个条目的持续时间不会超过6小时

如果没有可疑时间，两个时间戳之间的差异不超过1小时

可疑时间最有可能发生在第20000行之后和第30000行之前（因此，很可能会跳过其他一些行）

有没有办法在这里实现进一步的“黑客”呢？

这一切归结起来就是重新分解代码，使其更高效（从硬件+缓存的角度）。
我会考虑一些设计更改，并优化代码，以在读取操作时不创建或调用任何不必要的操作。

prev_time = None
with open(fp, 'rb') as fh:
    prev_time = datetime.datetime.strptime(fh.readline()[0:5].decode('utf-8'), '%H:%M')

    for line in fh:
        if len(line) == 0: continue

        t = datetime.datetime.strptime(line[0:5].decode('utf-8'), '%H:%M')
        if (t - prev_time).total_seconds() > 3600:
            print('Suspicious time:', t, '\told time:', prev_time)
        prev_time = t

首先，我们没有尝试执行逻辑是旧的吗？，而是在进入大的

for…

循环之前获取第一行并在其中输入时间。通过这种方式，我们可以为每一行读取节省几微秒的时间，而这最终会带来很多好处

然后，我们还将

与open一起使用

，这只是因为我们不希望在最后打开任何文件句柄。如果你要浏览很多文件，这一点很重要

我们还跳过

不为notEmpty

三行逻辑，如果是，则继续


我们还将时间转换缩短为不包括秒，这是一个小的编辑，但最终可能会节省大量时间，因为我们只使用了2/3的数据来进行操作
prev_time = None
with open(fp, 'rb') as fh:
    prev_time = datetime.datetime.strptime(fh.readline()[0:5].decode('utf-8'), '%H:%M')

    for line in fh:
        if len(line) == 0: continue

        t = datetime.datetime.strptime(line[0:5].decode('utf-8'), '%H:%M')
        if (t - prev_time).total_seconds() > 3600:
            print('Suspicious time:', t, '\told time:', prev_time)
        prev_time = t

最后一个改进是，我们将文件作为二进制对象打开，这意味着我们跳过了Python代码中可能完成的任何自动的二进制->十六进制/ascii
转换。这将对处理速度产生巨大影响，唯一的缺点是strtime
需要一个类似字符串的对象。我的计算结果是（我没有大型文本文件源），5个字母的转换速度将比python内部将文档数据从二进制数据转换为字符串数据的整体速度快我可能在这里错了。
希望这能让你的生活有所改善。

噢，请记住，这只是单向的，这意味着如果时差向后移动，您将得到负值（它可能不会以顺序时间日志格式显示，但您永远不知道）
编辑：
寻找黑客
如果你能预测每一行长度的大致估计值，那么实际上这样做会更快：
data = fh.read(5)
t = datetime.datetime ...
fh.seek(128) # Skip 128 bytes, hopefully this is enough to find a new line.:
data = fh.read(5) # again
                              # This just shows you the idea, obviously not perfect working code here hehe.

要获取00:00
时间戳，显然需要对该逻辑进行更多操作，例如，您需要监视是否实际通过了标线\r\n
等，但是Python不知道一行有多长，但是查找\r\n标记与大致了解并能够跳过大部分数据相比，在时间查找方面具有巨大的优势。因此，考虑一下这一点，因为跳过大多数数据，使用通用的操作函数总是更快。
注意，我们在这里追求微秒，所以每一个疯狂的想法和体力劳动都可能在这里得到回报
使用seek进行附加黑客攻击：
假设您知道在一大堆中有足够多的类似时间戳，您可以通过执行以下操作轻松跳过几行：
for line in fh:
    if len(line) == 0: continue
    # Check the line

    fh.seek(56 * 10000) # Average length of a line is 56 characters (calculated this over a few of your lines, so give or take +-10 here)
                                     # And we multiply this with 10000, essentially skipping ~10k lines

如果这里有一个很大的时间跳跃，你可以做：
    if diff > 3600:
        fh.seek(fh.tell() - 5000)

跳回5000条线路，检查时差是否仍然与10公里线路上的时差一样大，那么可能确实存在时差。您也可以使用它来缩小时差发生的位置（但我将把这个留给您，有更巧妙的方法可以用最少的人工找到它，而不占用处理能力）
从本质上讲，这可以归结为~4次搜索，并通过检查新的行尾等手动执行fh中的行的。。大概是这样的：
02:38:18  0  RcvTxData - 11 : Telegram received and process completed - MCP35 Tx -24239
02:38:20  0  RcvNewTxNo - 3 : MCP36 Set receive trigger
02:38:21  0  RcvNewTxNo - 1 : 
02:38:21  0  RcvNewTxNo - 1 : MCP35 get new Tx 24241
02:38:23  0  RcvTxData - 11 : Telegram received and process completed - MCP36 Tx -13918
02:38:23  0  RcvNewTxNo - 3 : MCP36 Set receive trigger
02:38:24  0  RcvNewTxNo - 1 : 
02:38:24  0  RcvTxData - 11 : Telegram received and process completed - MCP35 Tx -24241
02:38:24  0  RcvNewTxNo - 3 : MCP35 Set receive trigger
02:38:27  0  RcvNewTxNo - 1 : 
02:38:27  0  RcvNewTxNo - 1 : MCP36 get new Tx 13920
09:44:54  0  RcvNewTxNo - 1 : 
09:44:54  0  RcvNewTxNo - 1 : MCP24 get new Tx 17702
09:44:54  0  RcvNewTxNo - 2 : MCP24 Read last Tx before new Tx 17702
09:44:56  0  RcvNewTxNo - 1 : 
09:45:00  0  RcvTxData - 7 :MCP24 Prepare normal TxData to DB
09:45:01  0  RcvTxData - 8 :MCP24 complete call GetTxData
09:45:02  0  RcvTxData - 11 : Telegram received and process completed - MCP10 Tx -9008
09:45:02  0  RcvNewTxNo - 3 : MCP10 Set receive trigger
09:45:04  0  RcvNewTxNo - 1 : 
09:45:04  0  RcvNewTxNo - 3 : MCP24 Set receive trigger
09:45:16  0  RcvNewTxNo - 1 : 
09:45:16  0  RcvNewTxNo - 1 : MCP19 get new Tx 9133
09:45:16  0  RcvNewTxNo - 2 : MCP19 Read last Tx before new Tx 9133
09:45:17  0  RcvTxData - 1 :MCP19 gwTx-9133 lastTx-9131 newTx-0
09:45:17  0  RcvTxData - 4 :MCP19 Adjusted newTxNo_Val-9132
09:45:17  0  RcvTxData - 4.1 :MCP19 FnCode PF
09:45:23  0  RcvTxData - 1 :MCP24 gwTx-17706 lastTx-17704 newTx-0

from functools import partial
import datetime
jump_gap = 56 * 10000 # average row length * how many rows you want to jump

def f_jump(fh, jump_size):
    fh.seek(fh.tell() + jump_size)
    while fh.read(1) not in ('\n', ''):
        continue
    return True

with open('log.txt', 'rb') as fh:
    prev_time = datetime.datetime.strptime(fh.read(5).decode('utf-8'), '%H:%M')
    f_jump(fh, jump_gap) # Jump to the next jump since we got a starting time

    for chunk in iter(partial(fh.read, 5), ''): # <-- Note here! We only read 5 bytes
                                                                 # there for it's very important we check for new
                                                                 # rows manually and set the pointed at the start
                                                                 # of a new line, this is what `f_jump()` does later.
        if chunk == '':
            break # we clearly hit rock bottom

        t = datetime.datetime.strptime(chunk.decode('utf-8'), '%H:%M')
        if (t - prev_time).total_seconds() > 3600:
            print('\tSuspicious time:', t, '\ttold time:', prev_time, '\tat: ', fh.tell())

        prev_time = t
        f_jump(fh, jump_gap)

您选择文件位置636
，

您将其输入到tail
中，如下所示：
[user@firefox ~]$ tail -c 636 log.txt 
ete call GetTxData
09:45:02  0  RcvTxData - 11 : Telegram received and process completed - MCP10 Tx -9008
09:45:02  0  RcvNewTxNo - 3 : MCP10 Set receive trigger

这向我显示了问题发生的位置，我现在可以回溯这些内容。

或者我可以疯狂地施展一些Linux忍者魔法，然后做：
 x=`tail -c 636 log.txt -n 1`; grep -B 20 -A 3 "$x" log.txt

这给了我确切的数据，这些数据发生在哪里，还有之前的20行，所以我可以稍微回溯一下
由于您需要行号（可能是您的老板或同事的行号），您可以将-n
添加到grep命令中，并以这种方式获取行号：
x=`tail -c 636 log.txt -n 1`; grep -B 20 -A 3 -n "$x" log.txt

[user@firefox ~]$ x=`tail -c 636 log.txt -n 1`; grep -B 20 -A 3 -n "$x" log.txt
8-02:38:24  0  RcvTxData - 11 : Telegram received and process completed - MCP35 Tx -24241
9-02:38:24  0  RcvNewTxNo - 3 : MCP35 Set receive trigger
10-02:38:27  0  RcvNewTxNo - 1 : 
11-02:38:27  0  RcvNewTxNo - 1 : MCP36 get new Tx 13920
12-09:44:54  0  RcvNewTxNo - 1 : 
13-09:44:54  0  RcvNewTxNo - 1 : MCP24 get new Tx 17702
14-09:44:54  0  RcvNewTxNo - 2 : MCP24 Read last Tx before new Tx 17702
15-09:44:56  0  RcvNewTxNo - 1 : 
16-09:45:00  0  RcvTxData - 7 :MCP24 Prepare normal TxData to DB
17-09:45:01  0  RcvTxData - 8 :MCP24 complete call GetTxData
18-09:45:02  0  RcvTxData - 11 : Telegram received and process completed - MCP10 Tx -9008
19-09:45:02  0  RcvNewTxNo - 3 : MCP10 Set receive trigger
20-09:45:04  0  RcvNewTxNo - 1 : 
21-09:45:04  0  RcvNewTxNo - 3 : MCP24 Set receive trigger
22-09:45:16  0  RcvNewTxNo - 1 : 
23-09:45:16  0  RcvNewTxNo - 1 : MCP19 get new Tx 9133
24-09:45:16  0  RcvNewTxNo - 2 : MCP19 Read last Tx before new Tx 9133
25-09:45:17  0  RcvTxData - 1 :MCP19 gwTx-9133 lastTx-9131 newTx-0
26-09:45:17  0  RcvTxData - 4 :MCP19 Adjusted newTxNo_Val-9132
27-09:45:17  0  RcvTxData - 4.1 :MCP19 FnCode PF
28:09:45:23  0  RcvTxData - 1 :MCP24 gwTx-17706 lastTx-17704 newTx-0

由于seek（）
hack的性质，精细颗粒化可能有点困难，但在这个示例中，我在第28行中找到了命中率，这不是使用