Linux 使用Sed从日志文件中提取XML内容，并将每个结果转储到不同的文件中_Linux_Bash_Sed

Linux 使用Sed从日志文件中提取XML内容，并将每个结果转储到不同的文件中

linux bash sed

Linux 使用Sed从日志文件中提取XML内容，并将每个结果转储到不同的文件中,linux,bash,sed,Linux,Bash,Sed,我有以下10GB的日志文件，需要直接在Unix服务器上进行分析 2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message1 2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message2 2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message3 2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some mes

我有以下10GB的日志文件，需要直接在Unix服务器上进行分析

2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message1
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message2
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message3
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message4
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message5
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG some message6
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>1</id> 
<!—- id is not unique since the XML data provides all the
information of an object X defined by its id at a specific point in time -->
some XML content on more than 500 lines
</xml>
2017-12-12 13:04:30,330 [ABC] [DEF] DEBUG some message8
2017-12-12 13:04:30,333 [ABC] [DEF] DEBUG some message9
2017-12-12 13:04:30,334 [ABC] [DEF] INFO some message10
2017-12-12 13:04:30,334 [ABC] [DEF] INFO some message11
2017-12-12 13:04:31,431 [ABC] [DEF] INFO some message12
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>2</id>
some XML content on more than 500 lines 
</xml>
2017-12-12 13:04:31,432 [ABC] [DEF] DEBUG some message13
2017-12-12 13:04:31,476 [ABC] [DEF] INFO some message14
2017-12-12 13:04:31,476 [ABC] [DEF] DEBUG some message14
2017-12-12 13:04:31,490 [ABC] [DEF] DEBUG some message15
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>1</id>
some XML content on more than 500 lines 
</xml>
2017-12-12 13:04:31,491 [ABC] [DEF] DEBUG some message16
2017-12-12 13:04:31,491 [ABC] [DEF] DEBUG some message17
2017-12-12 13:04:31,496 [ABC] [DEF] DEBUG some message18
2017-12-12 13:04:31,996 [ABC] [DEF] INFO some message19

我考虑使用一种解决方案，在该解决方案中，我可以使用XML标记的反向引用来命名要转储它的文件，但它不起作用，因为相同的标记值确实出现在日志文件的不同位置，这将覆盖以前的提取

sed -r 's~(<xml>…<id>(.*)</id>…</xml>)~echo "\1" >> \2.out~e' file.in #just a prototype

对于awk，如果XML内容在一行上，那么它也将非常简单。但是，情况并非如此，我不知道应该为RS定义哪一行分隔符，以便将XML内容当作一行处理，并将其转储到单独的文件中

对于awk，我认为可行的是：

首先在日志中标识起始标记，并将测试变量更改为yes 将XML的每一行存储在一个缓冲区变量中，然后在我获得数据后立即将其转储到一个文件$i.out中，当然，还要将测试变量重置为no。如果您有一个使用awk的更好的解决方案，或者一个使用sed的解决方案，其中我可以访问一个包含当前正在处理的模式数量的变量，并重用它来生成输出文件，那就太好了。类似这样的内容：用于生成文件\u$current\u pattern\u position.out的current\u pattern\u position

我已经使用awk和perl获得了非常有趣的解决方案。我想为这个案例提供一个sed工作解决方案

perl一行程序

perl -ne 'if(s/.*(?=<xml>)//){$x++;open$fh,">file$x.xml"}if($fh){print$fh $_}if(/<\/xml>/){close$fh;undef$fh}' input.txt

工作原理

-n：这类似于sed-n将在不打印的情况下读取输入或参数文件

s/*？=//：删除之前的左侧部分，如果匹配，则计算为true

GNU Awk解决方案：

awk -v RS='<xml>|</xml>' '!(NR%2){ 
           gsub(/^[[:space:]]*|[[:space:]]*$/, ""); 
           printf "<xml>\n%s\n</xml>\n",$0 > "file"++c".xml";
           close("file"c".xml")
       }' file

查看结果：

$ head file*.xml
==> file1.xml <==
<xml>
<id>1</id> 
<!—- id is not unique since the xml data provides all the
information of an object X defined by its id at a specific point in time -->
some xml content on more than 500 lines
</xml>

==> file2.xml <==
<xml>
<id>2</id>
some xml content on more than 500 lines
</xml>

==> file3.xml <==
<xml>
<id>1</id>
some xml content on more than 500 lines
</xml>

如果日志中的xml对象太多，可能会出现错误：打开的文件太多，因此我添加了一个可选的关闭文件。

更新：下面是一个使用Sed的可移植、简化的方法：

#!/bin/sed -nf

# Execute the following group of commands for each line in the XML node to
# generate a series of shell commands that we'll feed into an interpreter:
/<xml>/,/<\/xml>/ {
    # Extract the ID number to generate a command that changes the output file:
    /^<id>\([0-9]\+\)<\/id>$/ {
        # Using the same pattern as above, substitute the ID number into a
        # command that updates the current output file and increments a counter
        # for the ID that we'll append as the filename extension:
        s//c\1=$(( c\1 + 1 )); exec > "file\1.$c\1"/
        # Output the generated command:
        p
        # Then, proceed to the next line:
        n
    }
    # Output any remaining lines in the XML block except for the <xml> tags:
    /<xml>\|<\/xml>/ !{
        # Escape any single quotes in the XML content (so we can wrap it in a
        # shell command below):
        s/'/'"'"'/g
        #'# (...ignore or remove this line...)
        # Generate a command that will write the line to the current file:
        s/^.*$/echo '&'/
        # Output the generated command:
        p
    }
}

它是有效的，但是我们可以看出Sed并不是解决这个问题的最佳工具。Sed的simple语言不是为这种逻辑设计的，因此代码并不美观，我们依赖shell生成文件，这增加了一点开销。如果您很难使用Sed，那么工作可能需要更长的时间。对于性能关键的问题，考虑使用其他答案中描述的工具之一。根据问题中的信息和示例，我假设我们不希望在输出中使用开始和结束标记，并且ID在它自己的行中总是一个数字。实现使用数字扩展名写入文件名，当发现重复的ID fileID.count、file1.1、file1.2等时，该扩展名将递增。。如果需要，更改这些细节应该很容易

注意：如果需要，修订历史记录包含一个使用GNU Sed的版本，另一个使用为简洁起见我删除的包装器脚本。它们可以工作，但速度太慢或太复杂。

谢谢您的回答，但这正是我在文章中提出的建议：-提取每个xml消息并将其转储到单独的文件中。这里没有缓冲。但是，是的，这通常是正确的方法。@Allan为什么要在不需要时使用内存缓冲？谢谢，我添加了修改。sub/gsub作为条件很好。顺便问一下，你认为有可能用sed实现吗？这会累积打开的文件描述符，在1020年左右的时间后可能会出现问题。请参阅ulimit-n unique id-a closefilec.xml将解决此问题。感谢您的回答，顺便问一下，您认为可以用sed来实现吗？sed不是设计用来写入多个文件的，显然可以用gnu-sed使用echo+>>来实现，但它可能不是那么理想，因为为每行打开文件，awk也是如此。我认为perl更合适，正如问题所述，使用任何unix命令/工具的任何其他性感解决方案当然都是受欢迎的。我不确定您对awk的相同含义-awk在使用输出文件方面的效率不亚于perl，而且它肯定不会每行打开一个输出文件。实际上，查看系统调用可以查看所有输出使用awk解决方案，文件保持打开状态，而在perl中，它们可以被明确地关闭。请明确这一点，这样就不会有人浪费时间发布您不想要的内容-您真的想要一个只使用一个sed脚本的解决方案，就像您目前为止拥有等效的awk和perl解决方案一样，还是想要一个可能使用多个sed的bash解决方案电话和其他工具？如果只是sed，它应该是可移植的，还是可以特定于一个sed变体，例如GNU sed？理想情况下是1个sed脚本解决方案，如果不可能，可以使用多个调用。GNU sed很好：谢谢你的帮助！如果您需要更多信息，请告诉我，除非您想对GNU sed的e命令进行一些修改，这最终归结为将shell脚本塞进sed，否则我认为您不能使用单个sed调用写入多个不同的文件

nswer对此了如指掌，只是OP指出s不一定是独一无二的。如果是这样的话，第二个非唯一文件会将应该是第二个文件的内容附加到第一个文件中，并带有该编号。@agc你说得对！我误读了这个问题，认为OP想要对每个ID的输出进行分组。明天我将再次讨论这个问题。谢谢我更新了答案，以解决重复ID和shell引用问题/cc@agc

awk 'sub(/.*<xml>/,"<xml>") {out="file" ++i ".xml"; p=1}
     p {print > out}
     /<\/xml>/ {p=0; close(out)}
' file

#!/bin/sed -nf

# Execute the following group of commands for each line in the XML node to
# generate a series of shell commands that we'll feed into an interpreter:
/<xml>/,/<\/xml>/ {
    # Extract the ID number to generate a command that changes the output file:
    /^<id>\([0-9]\+\)<\/id>$/ {
        # Using the same pattern as above, substitute the ID number into a
        # command that updates the current output file and increments a counter
        # for the ID that we'll append as the filename extension:
        s//c\1=$(( c\1 + 1 )); exec > "file\1.$c\1"/
        # Output the generated command:
        p
        # Then, proceed to the next line:
        n
    }
    # Output any remaining lines in the XML block except for the <xml> tags:
    /<xml>\|<\/xml>/ !{
        # Escape any single quotes in the XML content (so we can wrap it in a
        # shell command below):
        s/'/'"'"'/g
        #'# (...ignore or remove this line...)
        # Generate a command that will write the line to the current file:
        s/^.*$/echo '&'/
        # Output the generated command:
        p
    }
}

$ sed -nf parse_log.sed < file.in | sh

sed -n '/<xml>/,/<\/xml>/ {                             
    /^<id>\([0-9]\+\)<\/id>$/{s//c\1=$(( c\1 + 1 ));exec > "file\1.$c\1"/;p;n;}
    /<xml>\|<\/xml>/!{'"s/'/'\"'\"'/g;"'s/^.*$/echo '"'&'"'/;p;}                
}' < file.in | sh