Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/283.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用python正则表达式解析此日志并使用pandas导出到excel(可选)?_Python_Regex_Pandas_Re - Fatal编程技术网

如何使用python正则表达式解析此日志并使用pandas导出到excel(可选)?

如何使用python正则表达式解析此日志并使用pandas导出到excel(可选)?,python,regex,pandas,re,Python,Regex,Pandas,Re,我有一个以下格式的日志文件。对于每一行,我需要捕获第三列,例如0102b69880c4b330,相应的消息DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG及其各自的计数,请参见输出。我认为使用正则表达式对我来说更容易解决问题 说明: 案例1:ID 0102b69880c4b330在第1、2、3行出现3次。因此,ID的计数为3,相应的消息DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG也发生了3次,因此计数为3 案例2:现在第4行和第5行中的ID 0102b

我有一个以下格式的日志文件。对于每一行,我需要捕获第三列,例如0102b69880c4b330,相应的消息DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG及其各自的计数,请参见输出。我认为使用正则表达式对我来说更容易解决问题

说明:

案例1:ID 0102b69880c4b330在第1、2、3行出现3次。因此,ID的计数为3,相应的消息DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG也发生了3次,因此计数为3

案例2:现在第4行和第5行中的ID 0102b69880c4e3b2有两条不同的消息JMS DO_方法跟踪启动,DO_方法跟踪启动,ID计数为2,但其消息的计数应分别为1和1

案例3:从第10行到最后一行的ID 0102B698800000C有消息DM_WORKFLOW_E_PROCESS_AUTO_TASK。ID计数为3,消息计数为3。但是在这里,我需要获取这个错误消息旁边的流程任务id和工作流id

我在输出中使用[Ignore for this]来解释我不需要id

最后,我还需要维护DM_工作流、E_流程、自动任务的总数

下面是我尝试测试的程序。我没有正确使用ID列之后的正则表达式,我只是选择了包含[]内值的值,但它跳过了不包含该值的值。它也不会选择流程任务id和工作流id。您能帮我修改代码以获得正确的计数、任务id和工作流id吗

import re
import collections

regexp = re.compile(
        r'(?P<date>[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{6}\s*)'+
        '(?P<un_num>[0-9]{3,5}\[[0-9]{3,5}\]\s*)'+
        '(?P<id>[a-z0-9]{16}\s*)'+
        '(?P<message>\[(.*?)\])'
        )
ls = ["2019-05-05T00:05:11.507245   12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent",
      "2019-05-05T00:05:11.759829   12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.",
      "2019-05-05T00:05:11.759898   12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.",
      "2019-05-05T01:40:53.148751   20135[20135]    0102b69880c4e3b2    JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod",
      "2019-05-05T01:40:53.148877   20135[20135]    0102b69880c4e3b2    DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod",
      "2019-05-07T05:42:21.171087   22484[22484]    0102b6988000000b    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  'Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors.'",
      "2019-05-05T05:44:35.410674   12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  'Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
      "2019-05-05T05:50:31.383668   12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  'Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
      "2019-05-05T05:53:49.978053   12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  'Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'"
      ]

id_counter = collections.Counter()
message_counter = collections.Counter()

print("started......!!!!!")
for i in range(len(ls)):
    x = regexp.match(ls[i])
    y = re.search(regexp, ls[i])
    if x is None or y is None:
        print("None")
        continue
    print("-----------------")
    print(y.group('date'))
    print(y.group('un_num'))
    print(y.group('id'))
    id_counter.update([y.group('id')])
    print(y.group('message'))
    message_counter.update([y.group('message')])

print("end....!!!")

print(id_counter)
print(message_counter)

def print_counts(cdict):
    for key, values in enumerate(cdict.items()):
        print(key, values)

print_counts(id_counter)
print_counts(message_counter)

从作为文本的输入数据开始:

txt = """
2019-05-05T00:05:11.507245  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
2019-05-05T00:05:11.759829  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
2019-05-05T00:05:11.759898  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
2019-05-05T01:40:53.148751  20135[20135]    0102b69880c4e3b2    JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod 
2019-05-05T01:40:53.148877  20135[20135]    0102b69880c4e3b2    DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
2019-05-07T05:42:21.171087  22484[22484]    0102b6988000000b    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
2019-05-05T05:24:48.483966  17114[17114]    0102b69880c4fb1e    JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
2019-05-05T05:24:50.362650  17114[17114]    0102b69880c4fb1e    JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod 
2019-05-05T05:24:50.362702  17114[17114]    0102b69880c4fb1e    DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
2019-05-05T05:44:35.410674  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:50:31.383668  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:53:49.978053  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
"""
我们可以进行一些预处理,首先拆分为行并丢弃空行:

lines = [line for line in txt.split('\n') if line.strip()]
然后提取我们感兴趣的数据块,但只是对数据进行粗略且非常快速的分割

parts = [(line[44:60], line[64:].split(':', 1))  for line in lines]
更新:由于您的新数据不是固定宽度的,因此我们需要其他方式对其进行预处理,例如:

# parts = [(line[44:60], line[64:].split(':', 1))  for line in lines]
import re
lines = [re.sub(r'\s+', ' ', line) for line in lines]   # squash all multiple spaces to a single space
parts = [line.split() for line in lines]  # split on whitespace
parts = [(line[2], ' '.join(line[3:]).split(':', 1)) for line in parts]  # this is similar to the original line
请记住,这一部分只是为了使下面的InputData类中的最终处理更容易

然后,我们为我们感兴趣的输入数据创建一个数据结构,它可以将我们的预处理数据分成几部分:

class InputData(object):
    def __init__(self, idtag, (msg, details)):  # py3 is more awkward here (*)
        self.idtag = idtag
        self.error_task = None
        self.error_workflow = None
        msg = msg.strip()
        if msg.endswith('] info'):
            self.msg = msg[1:-len('] info')]
        elif msg.endswith('error'):
            self.msg = msg[1:-len(']error')]
            self.error_task = details.split(' task ', 1)[1].split(' ', 1)[0]
            self.error_workflow = details.split(' workflow ', 1)[1].split('.', 1)[0]
        else:
            self.msg = msg

    def __repr__(self):
        return repr(self.__dict__)  # this is a great trick for making debugging easier
*对于py3,您不需要确定他们为什么更改了此

def __init__(self, idtag, tmp):
    msg, details = tmp
现在我们可以将此类应用于预处理的输入:

input_data = [InputData(*part) for part in parts]
如果我们打印出目前为止的数据:

for d in input_data:
    print d
结果是:

{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'JMS DO_METHOD TRACE LAUNCH'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'DO_METHOD TRACE LAUNCH'}
{'error_workflow': '4d02b6988000f709', 'error_task': '4a02b698800aad04', 'idtag': '0102b6988000000b', 'msg': 'DM_WORKFLOW_E_PROCESS_AUTO_TASK'}
...
现在,我们创建一个类,表示我们希望在输出中使用的数据:

from collections import defaultdict

class OutputData(object):
    def __init__(self):   # I'm using this class in a defaultdict, so the __init__ method can't take any arguments
        self.idtag = None
        self.idtag_count = 0
        self.messages = defaultdict(int)
        self.errors = []
        self.workflows = []

    def add(self, indata):
        "Adds indata to this object."
        self.idtag = indata.idtag
        self.idtag_count += 1
        self.messages[indata.msg] += 1        
        if indata.error_task:
            self.errors.append(indata.error_task)
            self.workflows.append(indata.error_workflow)
并将输入数据输入其中:

output_data = defaultdict(OutputData)

for indata in input_data:
    output_data[indata.idtag].add(indata)
最后,我们可以以所需的格式输出输出数据:

fmt = '%-20s %-6s %-55s %-15s %-60s %s'

print fmt % ('ID:', 'Count:', 'Message:', 'msg counts', 'taskid', 'workflowid')
for outdata in output_data.values():
    print fmt % (
        outdata.idtag,
        outdata.idtag_count,
        ', '.join(outdata.messages.keys()),
        ', '.join(str(outdata.messages[k]) for k in outdata.messages.keys()),
        ', '.join(outdata.errors),
        ', '.join(outdata.workflows)
    )

这种类型的结构,即:预处理文本,提取感兴趣的输入数据,将输入数据转换为输出数据,最后序列化/格式化输出数据;对于所有此类问题都很有效,并且它使将来的调试和修改变得更加容易。

您的数据看起来非常固定。。为什么要使用正则表达式?好的,那么我应该使用哪种方法呢?你能帮我吗?通常你会拆分行,解析部分,创建一个内部数据结构,然后以所需的格式输出。我需要遛狗,所以我现在没有时间给你写程序,对不起,太棒了。非常感谢你!下面的代码给出了Python 3类InputDataobject中的语法错误:def uu init uu self,idtag,msg,details:当我们调用-input_data=[InputData*part for part in parts]时,我有一个问题,在输入中我忘记了包含第二列,现在也包含了4位数字,比如2591[2591]。我应该如何修改我的代码以使用包含的3、4、5位数字。在这段代码中,如何通过检查44、60、64是否由3位、4位或5位数字组成来将其作为变量?parts=[line[44:60],line[64:].split':',1表示行中的行]As,我有50000行,我使用line=[],打开'sub.txt',r'表示fp:for-line-in-fp:lines.appendline给我未知的结果。你能告诉我哪里出了问题吗?我已经更新了你第一条评论的答案,第二条评论听起来像是一个新问题。请把这个作为一个新问题来问
output_data = defaultdict(OutputData)

for indata in input_data:
    output_data[indata.idtag].add(indata)
fmt = '%-20s %-6s %-55s %-15s %-60s %s'

print fmt % ('ID:', 'Count:', 'Message:', 'msg counts', 'taskid', 'workflowid')
for outdata in output_data.values():
    print fmt % (
        outdata.idtag,
        outdata.idtag_count,
        ', '.join(outdata.messages.keys()),
        ', '.join(str(outdata.messages[k]) for k in outdata.messages.keys()),
        ', '.join(outdata.errors),
        ', '.join(outdata.workflows)
    )