Python 我将如何解析以下日志？_Python_Parsing

Python 我将如何解析以下日志？

python parsing

Python 我将如何解析以下日志？,python,parsing,Python,Parsing,我需要按以下格式解析日志： ===== Item 5483/14800 ===== This is the item title Info: some note ===== Item 5483/14800 (Update 1/3) ===== This is the item title Info: some other note ===== Item 5483/14800 (Update 2/3) ===== This is the item title Info: some more no

我需要按以下格式解析日志：

===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8

我只需要提取每个项目的标题（项目5484/14800之后的下一行）和结果。
因此，我只需要保留带有项目标题和该标题结果的行，并放弃所有其他内容。
问题是，有时一个项目有注释（maxim 3），有时结果显示时没有附加注释，因此这使得它很棘手。
任何帮助都将不胜感激。我正在用python做解析器，但不需要实际的代码，但需要指出如何实现这一点

乐：我想要的结果是放弃所有其他东西，得到如下结果：

('This is the item title','Foo')
then
('This is this items title','Bar')

编辑：现在我看到了您要查找的结果，添加了更多内容。

解析不是使用正则表达式完成的。如果您有一个结构合理的文本（看起来和您一样），您可以使用更快的测试（例如line.startswith（）或类似的测试）。字典列表似乎是此类键值对的合适数据类型。不知道还能告诉你什么。这似乎很琐碎

好的，在这种情况下，regexp方法更合适：

import re
re.findall("=\n(.*)\n", s)

比列表理解快

[item.split('\n', 1)[0] for item in s.split('=\n')]

以下是我得到的：

>>> len(s)
337000000
>>> test(get1, s) #list comprehensions
0:00:04.923529
>>> test(get2, s) #re.findall()
0:00:02.737103

吸取的教训。

可能类似（

log.log

是您的文件）：

我建议启动一个循环来查找行中的“==”。让那把钥匙把你转到下一行的标题。设置一个查找结果的标志，如果在点击下一个“==”之前没有找到结果，则说没有结果。否则，用标题记录结果。重置您的标志并重复。您也可以将结果与标题一起存储在字典中，如果在标题和下一行“=”之间找不到结果，只需存储“无结果”

基于输出，这看起来非常简单。

您可以尝试这样的方法（在类似c的伪代码中，因为我不懂python）：

下面是一些不太好看的perl代码。也许你会发现它在某些方面很有用。快速破解，还有其他方法（我觉得这段代码需要防御）

#/usr/bin/perl-w
#
#$Id$
#
严格使用；
使用警告；
我的@ITEMS；
我的$item；
我的$state=0；
打开（FD，“{title}），die“似乎有什么不对劲，比抱歉更好安全。行$。：$Line\n”；
#如果我们有一个新的项目编号，请添加Previous项目并创建一个新项目。
如果（$item_number！=$item->{item_number}）{
推送（@ITEMS，$item）；
$item={}；
$item->{item_number}=$item_number；
}
}否则{
#第一项，没有项目。
$item={}；#创建新项。
$item->{item_number}=$item_number；
}
$state=1；
}elsif（$state==1）{
die“数据必须以标题开头。”如果（不是$item）；
#如果我们已经有了一个标题，请确保它匹配。
如果（$item->{title}）{
如果（$item->{title}ne$行）{
die“标题与项“.$item->{item_number}”不匹配，第$行：$line\n”；
}
}否则{
$item->{title}=$line；
}
$state++；
}elsif（（$state==2）和（$line=~/^Info:/））{
#只需确保对于状态2，我们有一行匹配信息。
$state++；
}elsif（$state==3）和（$line=~/^Test finished\.Result（[^.]+）\.Time\d+secunds{0,1}.$/）{
$item->{status}=$1；
$state++；
}elsif（（$state==4）和（$line=~/^Stats:/））{
$state++#统计数据之后，我们必须有一个新项目，否则我们将失败。
}否则{
die“无效数据，第$行：$行\n”；
}
}
#最后一项也需要注意。
推送（@ITEMS，$item）如果（$item）；
关闭FD；
#循环我们的项目并打印我们存储的信息。
对于$item（@ITEMS）{
打印$item->{item_number}。”（“$item->{status}。”）“$item->{title}。”\n”；
}

我知道您并没有要求真正的代码，但对于基于生成器的文本模切器来说，这是一个非常好的机会：

# data is a multiline string containing your log, but this
# function could be easily rewritten to accept a file handle.
def get_stats(data):

   title = ""
   grab_title = False

   for line in data.split('\n'):
      if line.startswith("====="):
         grab_title = True
      elif grab_title:
         grab_title = False
         title = line
      elif line.startswith("Test finished."):
         start = line.index("Result") + 7
         end   = line.index("Time")   - 2
         yield (title, line[start:end])


for d in get_stats(data):
   print d


# Returns:
# ('This is the item title', 'Foo')
# ('This is this items title', 'Bar')
# ('This is the title of this item', 'FooBar')

希望这足够简单。如果您对上述工作方式有任何疑问，请务必询问。

带有组匹配的正则表达式在python中似乎可以完成这项工作：

import re

data = """===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8"""


p =  re.compile("^=====[^=]*=====\n(.*)$\nInfo: .*\n.*Result ([^\.]*)\.",
                re.MULTILINE)
for m in re.finditer(p, data):
     print "title:", m.group(1), "result:", m.group(2)er code here

如果您需要有关正则表达式的更多信息，请检查：。

这是maciejka解决方案的延续（请参见此处的注释）。如果数据在daniels.log文件中，那么我们可以使用itertools.groupby逐项检查它，并对每个项目应用多行regexp。这个比例应该很好

import itertools, re

p = re.compile("Result ([^.]*)\.", re.MULTILINE)
for sep, item in itertools.groupby(file('daniels.log'),
                                   lambda x: x.startswith('===== Item ')):
    if not sep:
        title = item.next().strip()
        m = p.search(''.join(item))
        if m:
            print (title, m.group(1))

如果能看到您希望看到的确切输出，那将非常有帮助。类似于[（'Item 5483/14800'，'12'）…]？grep-A1-E“^===^Test“$LOGFILE | grep-B2”Test finished“| grep-v--| sed-E”$！Ns/\n/'-e“s/测试完成。（[^.]*）\..*/，\1/”使用GNU grep 2.2这是项目标题，结果Foo这是项目标题，结果栏这是项目标题，结果FooB+1。。。OP这样问：“任何帮助都将不胜感激。我正在用python做解析器，但不需要实际的代码，但有人指出我如何才能做到这一点？”幸好他不想要代码，我不知道python值多少钱：）多行代码使用得很好。唯一的问题是它的伸缩性不是很好（您需要立即将整个文件保存在内存中），如果他使用itertools.groupby来查看这些项目呢？这只是一个建议，而不是一个完整的解决方案。如果它读入缓冲区，直到遇到以'====

#!/usr/bin/perl -w
#
# $Id$
#

use strict;
use warnings;

my @ITEMS;
my $item;
my $state = 0;

open(FD, "< data.txt") or die "Failed to open file.";
while (my $line = <FD>) {
    $line =~ s/(\r|\n)//g;
    if ($line =~ /^===== Item (\d+)\/\d+/) {
        my $item_number = $1;
        if ($item) {
            # Just to make sure we don't have two lines that seems to be a headline in a row.
            # If we have an item but haven't set the title it means that there are two in a row that matches.
            die "Something seems to be wrong, better safe than sorry. Line $. : $line\n" if (not $item->{title});
            # If we have a new item number add previuos item and create a new.
            if ($item_number != $item->{item_number}) {
                push(@ITEMS, $item);
                $item = {};
                $item->{item_number} = $item_number;
            }
        } else {
            # First entry, don't have an item.
            $item = {}; # Create new item.
            $item->{item_number} = $item_number;
        }
        $state = 1;
    } elsif ($state == 1) {
        die "Data must start with a headline." if (not $item);
        # If we already have a title make sure it matches.
        if ($item->{title}) {
            if ($item->{title} ne $line) {
                die "Title doesn't match for item " . $item->{item_number} . ", line $. : $line\n";
            }
        } else {
            $item->{title} = $line;
        }
        $state++;
    } elsif (($state == 2) && ($line =~ /^Info:/)) {
        # Just make sure that for state 2 we have a line that match Info.
        $state++;
    } elsif (($state == 3) && ($line =~ /^Test finished\. Result ([^.]+)\. Time \d+ secunds{0,1}\.$/)) {
        $item->{status} = $1;
        $state++;
    } elsif (($state == 4) && ($line =~ /^Stats:/)) {
        $state++; # After Stats we must have a new item or we should fail.
    } else {
        die "Invalid data, line $.: $line\n";
    }
}
# Need to take care of the last item too.
push(@ITEMS, $item) if ($item);
close FD;

# Loop our items and print the info we stored.
for $item (@ITEMS) {
    print $item->{item_number} . " (" . $item->{status} . ") " . $item->{title} . "\n";
}

# data is a multiline string containing your log, but this
# function could be easily rewritten to accept a file handle.
def get_stats(data):

   title = ""
   grab_title = False

   for line in data.split('\n'):
      if line.startswith("====="):
         grab_title = True
      elif grab_title:
         grab_title = False
         title = line
      elif line.startswith("Test finished."):
         start = line.index("Result") + 7
         end   = line.index("Time")   - 2
         yield (title, line[start:end])


for d in get_stats(data):
   print d


# Returns:
# ('This is the item title', 'Foo')
# ('This is this items title', 'Bar')
# ('This is the title of this item', 'FooBar')

import re

data = """===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8"""


p =  re.compile("^=====[^=]*=====\n(.*)$\nInfo: .*\n.*Result ([^\.]*)\.",
                re.MULTILINE)
for m in re.finditer(p, data):
     print "title:", m.group(1), "result:", m.group(2)er code here

import itertools, re

p = re.compile("Result ([^.]*)\.", re.MULTILINE)
for sep, item in itertools.groupby(file('daniels.log'),
                                   lambda x: x.startswith('===== Item ')):
    if not sep:
        title = item.next().strip()
        m = p.search(''.join(item))
        if m:
            print (title, m.group(1))