Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中提取部分日志以导入Excel_Python_Regex - Fatal编程技术网

在Python中提取部分日志以导入Excel

在Python中提取部分日志以导入Excel,python,regex,Python,Regex,我正在尝试使用正则表达式获取部分日志(txt文件),但我需要一些帮助。基本上,日志是这样的: Tue Feb 24 17:51:10.835 SRV02 NOTICE Event Loop - noop Tue Feb 24 17:51:10.835 SRV02 NOTICE Exponential histogram: Tue Feb 24 17:51:10.835 SRV02 NOTICE hist[ 0]: < 0.001: 728941

我正在尝试使用正则表达式获取部分日志(txt文件),但我需要一些帮助。基本上,日志是这样的:

Tue Feb 24 17:51:10.835 SRV02    NOTICE  Event Loop - noop
Tue Feb 24 17:51:10.835 SRV02    NOTICE     Exponential histogram:
Tue Feb 24 17:51:10.835 SRV02    NOTICE     hist[ 0]: <      0.001: 728941854
Tue Feb 24 17:51:10.835 SRV02    NOTICE  Event Loop - noop: samples: 728941854; avg: 0.00; min: 0.00; max: 0.00
Tue Feb 24 17:51:10.835 SRV02    NOTICE  Data Quality Monitor Thread Processing Time
Tue Feb 24 17:51:10.835 SRV02    NOTICE     Exponential histogram:
Tue Feb 24 17:51:10.835 SRV02    NOTICE     hist[ 4]: <      0.016:         3
Tue Feb 24 17:51:10.835 SRV02    NOTICE     hist[ 5]: <      0.032:        23
Tue Feb 24 17:51:10.835 SRV02    NOTICE     hist[ 6]: <      0.064:        14
Tue Feb 24 17:51:10.835 SRV02    NOTICE     hist[ 7]: <      0.128:         4
Tue Feb 24 17:51:10.835 SRV02    NOTICE     hist[ 8]: <      0.256:         6
Tue Feb 24 17:51:10.835 SRV02    NOTICE     hist[ 9]: <      0.512:         1
Tue Feb 24 17:51:10.835 SRV02    NOTICE     hist[10]: <      1.024:         2
Tue Feb 24 17:51:10.835 SRV02    NOTICE  Data Quality Monitor Thread Processing Time: samples: 53; avg: 0.08; min: 0.01; max: 0.67
Tue Feb 24 17:51:10.835 SRV02    NOTICE  Client Hugepage Memory:   649/4096 MB 
Tue Feb 24 17:51:10.836 SRV02    NOTICE  DQM: Num R: 0 RD: 0 ED: 0 W: 0 WH: 0 Q: 0 D: 0 DF: 0
Tue Feb 24 17:51:10.836 SRV02    NOTICE  Num G: 0 M: 0 S: 0 D: 0 U: 0 R: 0 N: 0
Tue Feb 24 17:51:10.836 SRV02    NOTICE  num_template_allocs                       =          4
Tue Feb 24 17:51:10.836 SRV02    NOTICE  num_template_frees                        =          0
Tue Feb 24 17:51:10.836 SRV02    NOTICE  num_internal_book_allocs                  =         24
因此,在上面的示例中,我需要像这样提取和重新排列数据,其中“事件循环-noop”和“数据质量监视器线程处理时间”需要在每一行中重复,以便识别直方图:

Event Loop - noop;hist[ 0];0.001;728941854
Event Loop - noop;samples;728941854;avg;0.00;min;0.00;max;0.00
Data Quality Monitor Thread Processing Time;hist[ 4];0.016;3
Data Quality Monitor Thread Processing Time;hist[ 5];0.032;23
Data Quality Monitor Thread Processing Time;hist[ 6];0.064;14
(...)
Data Quality Monitor Thread Processing Time;hist[ 10];1.024;2
Data Quality Monitor Thread Processing Time;samples;53;avg;0.08;min;0.01;max;0.67

有人能帮我怎么做吗?谢谢大家!

在示例输出中,示例输入中没有数据。具体来说,您的数据中有更多的
“数据质量监视器线程处理时间”
字符串。似乎您想保留最近的缩进页眉

无论如何,我认为使用几个不同的正则表达式语句来提取数据会更容易,而不是试图制作一个包罗万象的语句:

import re
hists = re.findall(r'(hist\[\s\d+\]).*?(\d+\.\d+).*?(\d+)',input)
sample_avg_etc = re.findall(r'(samples): (\d+); (avg): (\d+\.\d+); (min): (\d+\.\d+); (max): (\d+\.\d+)',input)
如果需要保持本地标题与示例输出中显示的一样。我想你不想用正则表达式。相反,只需编写一个解析器来提取数据

您可以从2月24日星期二17:51:10.835 SRV02通知的每一行剥离开始,然后逐行瞄准数据,跟踪最后一个标题。请参阅注释,以下内容将返回您在上面列出的内容:

import re

def parse(data):
    lines = data.split('\n')  # get the lines by splitting on the newline char
    lines = [line[len("Tue Feb 24 17:51:10.835 SRV02    NOTICE  "):]  for  line in lines]  # remove the number of characters equal to the logging info
    out = []
    header = ''
    for line in lines:
        if line.startswith('   '):
            if line.strip().startswith('hist'):
                out.append(header + ";" + extract_hist_data(line))  # outsource the specific extracting to a function for ease of readability
        else:                      # header/samples line
            if all(i in line for i in ("samples", "avg", "min", "max")):  # if the line contains all these keywords
                out.append(header + ";" + extract_stat_data(line))  # outsource the specific extracting to a function for ease of readability
            else:  # Treat as a header
                header = line
    return '\n'.join(out)

def extract_hist_data(line):
    data = re.findall(r'(hist\[\s*?\d+\]).*?(\d+\.\d+).*?(\d+)',line)
    if len(data) > 0:
        data = data[0]
    else:
        return ""
    return ';'.join(i for i in data)

def extract_stat_data(line):
    data = re.findall(r'(samples).*?(\d+).*?(avg).*?(\d+\.\d+).*?(min).*?(\d+\.\d+).*?(max).*?(\d+\.\d+)',line)
    if len(data) > 0:
        data = data[0]
    else:
        return ""
    return ';'.join(i for i in data)

def parse_log_file(log_file_path):
    with open(log_file_path,'r') as f:
        content = ''.join(i for i in f)
    return parse(content)

print parse_log_file('test.log')

嘿,乔,非常感谢你的精彩剧本。我编辑了这个问题以便更容易理解。您的脚本非常接近理想的解决方案,缺少的只是在所有行(“事件循环-noop”和“数据质量监视器线程处理时间”)中重复标题(直方图名称),这样我就可以知道数据属于哪个直方图。运行脚本时,我得到以下输出:hist[4]:<0.016:3;样品;53;平均值;0.08;闵;0.01;最大值;0.67;, 但是我需要数据质量监视器线程处理时间之类的东西;历史[4];0.016;3@user179589你运行第二段代码了吗?它保留正确的标题。如果你说我的答案不见了,我认为你没有充分利用我的答案。哦,你说得对,我错过了最后一部分。非常感谢,问题解决了!
import re

def parse(data):
    lines = data.split('\n')  # get the lines by splitting on the newline char
    lines = [line[len("Tue Feb 24 17:51:10.835 SRV02    NOTICE  "):]  for  line in lines]  # remove the number of characters equal to the logging info
    out = []
    header = ''
    for line in lines:
        if line.startswith('   '):
            if line.strip().startswith('hist'):
                out.append(header + ";" + extract_hist_data(line))  # outsource the specific extracting to a function for ease of readability
        else:                      # header/samples line
            if all(i in line for i in ("samples", "avg", "min", "max")):  # if the line contains all these keywords
                out.append(header + ";" + extract_stat_data(line))  # outsource the specific extracting to a function for ease of readability
            else:  # Treat as a header
                header = line
    return '\n'.join(out)

def extract_hist_data(line):
    data = re.findall(r'(hist\[\s*?\d+\]).*?(\d+\.\d+).*?(\d+)',line)
    if len(data) > 0:
        data = data[0]
    else:
        return ""
    return ';'.join(i for i in data)

def extract_stat_data(line):
    data = re.findall(r'(samples).*?(\d+).*?(avg).*?(\d+\.\d+).*?(min).*?(\d+\.\d+).*?(max).*?(\d+\.\d+)',line)
    if len(data) > 0:
        data = data[0]
    else:
        return ""
    return ';'.join(i for i in data)

def parse_log_file(log_file_path):
    with open(log_file_path,'r') as f:
        content = ''.join(i for i in f)
    return parse(content)

print parse_log_file('test.log')