Python 检索数据，直到它与下一个正则表达式模式匹配_Python_Regex_Python 3.x

Python 检索数据，直到它与下一个正则表达式模式匹配

python regex python-3.x

Python 检索数据，直到它与下一个正则表达式模式匹配,python,regex,python-3.x,Python,Regex,Python 3.x,我已从服务器检索到错误日志数据，其格式如下：文本文件： 2018-01-09 04:50:25,226 [18] INFO messages starts here line1 \n line2 above error continued in next line 2018-01-09 04:50:29,226 [18] ERROR messages starts here line1 \n line2 above error continued in next

我已从服务器检索到错误日志数据，其格式如下：

文本文件：

2018-01-09 04:50:25,226 [18] INFO messages starts here line1 \n   
    line2 above error continued in next line  
2018-01-09 04:50:29,226 [18] ERROR messages starts here line1 \n  
    line2 above error continued in next line  
2018-01-09 05:50:29,226 [18] ERROR messages starts here line1 \n 
    line2 above error continued in next line

我需要检索错误/信息性消息以及日期时间戳

我已经用python编写了下面的代码，如果错误消息只在一行中，它可以正常工作，但是如果同一个错误记录在多行中，它就不能正常工作（在这种情况下，它只给出一行作为输出，但如果它属于同一个错误，我还需要下一行）

如果您能提供任何解决方案/想法，这将很有帮助

下面是我的代码：

 f = open('text.txt', 'r', encoding="Latin-1")
 import re    
 strr=re.findall(r'(\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})(\,\d{1,3}\s\[\d{1,3}\]\s)(INFO|ERROR)(.*)$', f.read(), re.MULTILINE)
 print(strr)

import re

matches = re.findall(r'(\d{4}(?:-\d{2}){2}\s\d{2}(?::\d{2}){2})(,\d+[^\]]+\])\s(INFO|ERROR)\s([\S\s]+?)(?=\r?\n\d{4}(?:-\d{2}){2}|$)', text)

以上代码给出的输出为：

[（'2018-01-09 04:50:25'，'226[18]，'INFO'，'messages从这里开始行1'），（'2018-01-09 04:50:29'，'226[18]，'ERROR'，'messages start 此处第1行“，（'2018-01-09 05:50:25'，'226[18]，'ERROR'，'messages 从这里开始第1行“）]

如我所料，输出为

[（'2018-01-09 04:50:25'，'226[18]，'INFO'，'messages从这里开始第1行上述第2行错误继续出现在下一行“，”（“2018-01-09 04:50:29'，'226[18]，'错误'，'消息从上面第1行第2行开始错误继续出现在下一行“），（'2018-01-09 05:50:29'，'226 [18] “，”错误“，”消息从这里开始上面的第1行第2行错误继续在下一行“）]

在正则表达式中添加\n：

(\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})(\,\d{1,3}\s\[\d{1,3}\]\s)(INFO|ERROR)(.*\n.*)

您可以使用前瞻表达式并搜索

（包含）和

（排除）结构之间的匹配。在您的情况下，每个日志记录都以

结构开始。您还需要删除

，因为在

re.MULTILINE

的情况下，它与新行匹配

编辑

你可以做得更好。找到

结构后，立即逐行运行。开始收集新的日志记录，直到观察到新的

结构。连接与一条记录相关的日志行并执行

regex

。移动到下一条记录。

这可能并不像您希望的那样整洁，但没有什么可以阻止您逐行检查并在进行过程中积累错误信息：

import re

example = '''2018-01-09 04:50:25,226 [18] INFO messages starts here line1
    line2 above error continued in next line
2018-01-09 04:50:29,226 [18] ERROR messages starts here line1
    line2 above error continued in next line
2018-01-09 05:50:29,226 [18] ERROR messages starts here line1
    line2 above error continued in next line  '''

output = []

for line in example.splitlines():
    match = re.match(r'(\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})'
                     r'(\,\d{1,3}\s\[\d{1,3}\]\s)(INFO|ERROR)(.*)',
                     line, re.MULTILINE + re.VERBOSE)
    if match:
        output.append(list(match.groups()))
    # Check that output already exists - in case of headers
    elif output:
        output[-1].append(line)

这是回报

[['2018-01-09 04:50:25', ',226 [18] ', 'INFO', ' messages starts here line1', '    line2 above error continued in next line'], ['2018-01-09 04:50:29', ',226 [18] ', 'ERROR', ' messages starts here line1', '    line2 above error continued in next line'], ['2018-01-09 05:50:29', ',226 [18] ', 'ERROR', ' messages starts here line1', '    line2 above error continued in next line  ']]

正则表达式：

Python代码：

 f = open('text.txt', 'r', encoding="Latin-1")
 import re    
 strr=re.findall(r'(\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})(\,\d{1,3}\s\[\d{1,3}\]\s)(INFO|ERROR)(.*)$', f.read(), re.MULTILINE)
 print(strr)

import re

matches = re.findall(r'(\d{4}(?:-\d{2}){2}\s\d{2}(?::\d{2}){2})(,\d+[^\]]+\])\s(INFO|ERROR)\s([\S\s]+?)(?=\r?\n\d{4}(?:-\d{2}){2}|$)', text)

输出：

[('2018-01-09 04:50:25', ',226 [18]', 'INFO', 'messages starts here line1\nline2 above error continued in next line'), ('2018-01-09 04:50:29', ',226 [18]', 'ERROR', 'messages starts here line1\nline2 above error continued in next line'), ('2018-01-09 05:50:29', ',226 [18]', 'ERROR', 'messages starts here line1\nline2 above error continued in next line')]

我不认为您是否需要正则表达式来完成此任务。只要将每一个奇数行连接到下一个。@Kasramvd我需要知道下一个错误是什么，因此我使用了正则表达式模式，并且每个错误可以有多行，而不一定只有两行与正则表达式中的一行（

）的末尾显式匹配。如果您不知道一个“日志行”可以拆分多少行，或者如果您有多个进程/线程同时写入日志，那么仅使用正则表达式可能太难做到这一点。如果您可以控制生成日志的内容，那么最好将其改为不生成包含换行符的日志消息，因为这会导致您可能希望对日志文件执行的各种操作出现问题。如果出现单行错误消息，则此操作将失败-您假设所有日志文件都有两行长。是的，您是对的。在本例中，始终有两行…或者如果某些消息拆分为两行以上。或许OP可以澄清该计划的范围problem@ElodieDellier谢谢。但是它没有给出预期的结果。它的显示结果直到\n@Sriharsha将（.*）替换为（.*\n.*）。抱歉，我不清楚。这里的错误消息可能包含日期时间戳，但我不相信它包含绝对类似的结构，如：

2018-01-09 04:50:29226[18]错误

非常感谢：-）对我来说工作正常！！！你能简要介绍一下你所做的正则表达式背后的逻辑吗？这样它对我和其他人都有帮助@斯里哈沙我很高兴能帮上忙。