使用python正则表达式解析文本文件中的相关行组

使用python正则表达式解析文本文件中的相关行组,python,regex,Python,Regex,我有一个包含以下文本的文件: $ more audit.log 2018-01-31 15:34:08 GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG: statement: DROP TABLE tmp_zombies 2018-01-31 15:58:52 GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG: statement: CREATE TEMP TABLE tmp_zombie

我有一个包含以下文本的文件:

$ more audit.log 2018-01-31 15:34:08 GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG: statement: DROP TABLE tmp_zombies 2018-01-31 15:58:52 GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG: statement: CREATE TEMP TABLE tmp_zombies(jagpid int4) 2018-01-31 15:58:52 GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG: statement: DROP TABLE tmp_zombies 2018-01-31 16:24:00 GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG: statement: CREATE TEMP TABLE tmp_zombies(jagpid int4) 2018-01-31 16:24:00 GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG: statement: DROP TABLE tmp_zombies 2018-01-31 21:08:47 GMT:[local]:pgsql@p106:[26349]00000:LOG: statement: create table global_pg_audit ( rolename text not null, stmt_timestamp timestamp not null, source_ip text, target_ip text, dbname text, pid text, statement_type text, statement text ); 2018-01-31 15:34:08 GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG: statement: DROP TABLE tmp_zombies $more audit.log 2018-01-31 15:34:08格林威治标准时间:10.34.160.60(63788):agent3@pem:[31884]00000:LOG:statement:DROP TABLE tmp_zombies 2018-01-31 15:58:52格林尼治标准时间:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:statement:CREATE TEMP TABLE tmp_zombies(jagpid int4) 2018-01-31 15:58:52格林尼治标准时间:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:statement:DROP TABLE tmp_zombies 2018-01-31 16:24:00GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG:语句:创建临时表tmp_zombies(jagpid int4) 2018-01-31 16:24:00GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG:statement:DROP TABLE tmp_zombies 2018-01-31 21:08:47格林尼治标准时间:[当地]:pgsql@p106:[26349]00000:LOG:statement:create table global_pg_audit ( rolename文本不为空, stmt_时间戳时间戳不为空, 来源:ip文本, 目标ip文本, dbname文本, pid文本, 语句类型文本, 语句文本 ); 2018-01-31 15:34:08格林威治标准时间:10.34.160.60(63788):agent3@pem:[31884]00000:LOG:statement:DROP TABLE tmp_zombies 当我运行此python代码时:

import re fullpathname='./audit.log' regex_pattern=re.compile(r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(.*?)$',re.MULTILINE|re.DOTALL) with open(fullpathname,'r') as f: log_entries = regex_pattern.findall(f.read()) counter=0 for entry in log_entries: print '%d=>['%(counter),entry,']' counter=counter+1 进口稀土 fullpathname='./audit.log' regex_pattern=re.compile(r'^(\d{4}-\d{2}-\d{2}\d{2}:\d{2})(*?$),re.MULTILINE | re.DOTALL) 将open(fullpathname,'r')作为f: log\u entries=regex\u pattern.findall(f.read()) 计数器=0 对于日志_条目中的条目: 打印“%d=>[”%(计数器),条目“]” 计数器=计数器+1 结果如下:

0=>[ ('2018-01-31 15:34:08', ' GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG: statement: DROP TABLE tmp_zombies') ] 1=>[ ('2018-01-31 15:58:52', ' GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG: statement: CREATE TEMP TABLE tmp_zombies(jagpid int4)') ] 2=>[ ('2018-01-31 15:58:52', ' GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG: statement: DROP TABLE tmp_zombies') ] 3=>[ ('2018-01-31 16:24:00', ' GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG: statement: CREATE TEMP TABLE tmp_zombies(jagpid int4)') ] 4=>[ ('2018-01-31 16:24:00', ' GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG: statement: DROP TABLE tmp_zombies') ] 5=>[ ('2018-01-31 21:08:47', ' GMT:[local]:pgsql@p106:[26349]00000:LOG: statement: create table global_pg_audit ') ] 6=>[ ('2018-01-31 15:34:08', ' GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG: statement: DROP TABLE tmp_zombies') ] 7=>[ ('2018-01-31 15:58:52', ' GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG: statement: CREATE TEMP TABLE tmp_zombies(jagpid int4)') ] 0=>[('2018-01-31 15:34:08','GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG:statement:DROP TABLE tmp_zombies')] 1=>[('2018-01-31 15:58:52','GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:statement:CREATE TEMP TABLE tmp_zombies(jagpid int4)] 2=>[('2018-01-31 15:58:52','GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:statement:DROP TABLE tmp_zombies')] 3=>[('2018-01-31 16:24:00','GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG:statement:CREATE TEMP TABLE tmp_zombies(jagpid int4)] 4=>[('2018-01-31 16:24:00','GMT:10.34.160.55(57199):agent8@pem:[27888]00000:LOG:statement:DROP TABLE tmp_zombies')] 5=>[('2018-01-31 21:08:47','GMT:[当地]:pgsql@p106:[26349]00000:LOG:statement:createtableglobal_pg_audit')] 6=>[('2018-01-31 15:34:08','GMT:10.34.160.60(63788):agent3@pem:[31884]00000:LOG:statement:DROP TABLE tmp_zombies')] 7=>[('2018-01-31 15:58:52','GMT:127.0.0.1(45050):agent1@pem:[13182]00000:LOG:statement:CREATE TEMP TABLE tmp_zombies(jagpid int4)] 请注意,在输出的第5行中,代码没有包含整个语句,该语句应为:

create table global_pg_audit ( rolename text not null, stmt_timestamp timestamp not null, source_ip text, target_ip text, dbname text, pid text, statement_type text, statement text ); 创建表全局\u pg\u审计 ( rolename文本不为空, stmt_时间戳时间戳不为空, 来源:ip文本, 目标ip文本, dbname文本, pid文本, 语句类型文本, 语句文本 ); 代码有什么问题


非常感谢

您的正则表达式被锚定到行的末尾:

^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(.*?)$
由于启用了多行模式,
$
在换行时匹配。这就是为什么比赛在
global\u pg\u audit
之后结束


您希望匹配到以日期开头的下一行。您可以使用前瞻来执行此操作:

^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(.*?)(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)
交替
|\Z
允许正则表达式匹配最后一行,即使后面没有日期


另请参见。

谢谢。效果很好。