分割和处理XML文件_Xml_Perl - Fatal编程技术网

分割和处理XML文件

xml perl

分割和处理XML文件,xml,perl,Xml,Perl,我有一个original.xml文件，其结构如下： <root> <parent attr1="val1" attr2="val2" ... > <child key1="val3" key2="val4" ... /> <child key1="val5" key2="val6" ... /> ... </parent> ... <parent at

我有一个

original.xml

文件，其结构如下：

<root>
    <parent attr1="val1" attr2="val2" ... >
        <child key1="val3" key2="val4" ... />
        <child key1="val5" key2="val6" ... />
        ...
    </parent>
    ...
    <parent attr1="val7" attr2="val8" ... />
    ...
</root>

为此，我使用以下perl单行命令：

perl -p -i -e 'open(F, ">", "new/".($1).".xml") if /<parent attr1="(.*)" attr2="(.*)" ... /; print {F} "<newroot><newparent attr1=\"".($1)."\" attr2=\"".($2)."\" /></newroot>";' "original.xml"

我不太明白为什么会这样。如何使perl命令只输出一个包含所需数据的

newroot

元素？

我建议使用XSLT来解决这个问题，例如，LibXSLT支持

exsl:document

（请参阅），这样您就可以编写XSLT样式表来实现这一点

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl"
  exclude-result-prefixes="exsl"
  version="1.0">

<xsl:template match="/">
  <xsl:apply-templates select="root/parent[@attr1]" mode="new"/>
</xsl:template>

<xsl:template match="parent" mode="new">
  <xsl:message>Writing file <xsl:value-of select="@attr1"/></xsl:message>
  <exsl:document href="{@attr1}.xml" method="xml" indent="yes">
    <newroot>
      <xsl:apply-templates select="."/>
    </newroot>
  </exsl:document>
</xsl:template>

<xsl:template match="parent">
  <newparent>
    <xsl:copy-of select="@*"/>
  </newparent>
</xsl:template>

</xsl:stylesheet>

正如您在评论中指出的，您遇到了内存和/或性能问题，这里有一种替代方法，使用

LibXML:：Reader

，它是一种前向读取拉解析器，不会将完整的XML加载到内存中的树结构中：

  use strict;
  use warnings;
  use XML::LibXML::Reader;
  use XML::LibXML;

  my $reader = XML::LibXML::Reader->new(location => "input.xml")
         or die "cannot read file.xml\n";

  while ($reader->read) {
    processNode($reader);
  }

  sub processNode {
      my $reader = shift;

      if ($reader->nodeType == XML_READER_TYPE_ELEMENT && $reader->name eq "parent")
      {
        my $clone = $reader->copyCurrentNode(0);
        $clone->setName('newparent');

        my $doc = XML::LibXML::Document->new( );
        $doc->setDocumentElement($doc->createElement('newroot'));
        $doc->documentElement()->appendChild($clone);

        my $filePrefix = $clone->getAttribute('attr1');
        my $fileName = "$filePrefix-result.xml";
        print "Writing file $fileName.\n";

        $doc->toFile($fileName, 1);

      }
  }

我希望它可以避免内存问题。

不要使用正则表达式。使用解析器。我很喜欢。（

XML:：LibXML

也很不错）

给定您的示例数据，这大致满足您的要求

好吧，那么你没有提到的事情是：

马丁非常感谢。这很有效。但有一件事我忘了提。My original.xml有超过一百万个父标记

这一点非常重要——XML是一个标记匹配过程，这意味着它不能确保它已经完成，并且在它到达末尾之前XML是有效的。这通常意味着解析整个文档以验证标记是否匹配

XML的缺点之一是它的内存占用通常是文件大小的10倍左右

然而，

XML:：Twig

还有另一个有用的特性--

Twig\u处理程序

和

清除

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

sub extract_parent_elt {
    my ( $twig, $parent_elt ) = @_;

    #remove children if that's what you want?
    $_->delete for $parent_elt->children();

    #pick out the attr for our file name
    my $newname = $parent_elt->att('attr1');
    print "Opening:", $newname, ".xml\n";

    #create a new document - insert a 'newroot' as it's root element. 
    my $new_doc = XML::Twig->new->set_root( XML::Twig::Elt->new('newroot') );
    #cut and paste this element into the new doc. 
    $parent_elt->cut;
    $parent_elt->paste( $new_doc->root );
    #note - because we're purging and not saving the 'old' doc, a
    #cut doesn't modify the original. 

    #open output:
    open( my $output, '>', "$newname.xml" ) or die $!;
    #note - pretty print has some limitations. 
    #specifically - there are some XML things that it breaks. 
    #your code doesn't _appear_ to have these. 
    $new_doc -> set_pretty_print('indented_a'); 
    print {$output} $new_doc->sprint;
    close($output);

    #discard everything so far.
    $twig->purge;
}

my $twig = XML::Twig->new( 
                twig_handlers => { 'parent' => \&extract_parent_elt } );

$twig -> parsefile('original.xml');

处理程序在找到匹配的“close”元素时启动，并获得该XML块

purge

告诉twig放弃迄今为止已处理的任何内容（例如，任何具有“已关闭”标记的内容）

您考虑过使用XSLT吗？我认为libxslt可以从Perl中使用，并且它支持exsl:document，这样就可以正确处理XML了。我肯定要花几个星期的时间来牢牢掌握它，但是对于这个特殊的问题，我感觉切换到XSLT需要花费太长的时间，因为它几乎完成了，只有这一个bug破坏了它；打印“…”让我觉得很奇怪：如果if块中没有任何内容，if将实现什么？@bytepusher:这是作为语句修饰符的

if

。它的

打开（…）if/../@BytePasher我不确定我是否理解。我需要if
块，以便重用属性值。Martin非常感谢。这很有效。但有一件事我忘了提。我的original.xml
有超过一百万个parent标记。当我运行你的代码时，它确实按预期工作，但它完全阻塞了我的计算机。完成这一切可能需要几个小时，甚至几天。另一方面，我以前的脚本速度更快，效率更高。这就是为什么我决定首先使用它。有什么方法可以坚持使用perl吗？@我添加了另一种方法，在读取大文件时性能会更好。Sobrique非常感谢。我试过你的代码，它也运行得很好，但与Martin的代码有相同的缺点。请看一看。XML的大小在问题中很重要。但是是的，这是可以通过细枝处理器实现的。下一步靠近键盘时，我将编辑一个示例
  use XML::LibXSLT;
  use XML::LibXML;

  my $xslt = XML::LibXSLT->new();

  my $source = XML::LibXML->load_xml(location => 'original.xml');
  my $style_doc = XML::LibXML->load_xml(location => 'sheet1.xsl');

  my $stylesheet = $xslt->parse_stylesheet($style_doc);

  my $results = $stylesheet->transform($source);

  print $stylesheet->output_as_bytes($results);

  use strict;
  use warnings;
  use XML::LibXML::Reader;
  use XML::LibXML;

  my $reader = XML::LibXML::Reader->new(location => "input.xml")
         or die "cannot read file.xml\n";

  while ($reader->read) {
    processNode($reader);
  }

  sub processNode {
      my $reader = shift;

      if ($reader->nodeType == XML_READER_TYPE_ELEMENT && $reader->name eq "parent")
      {
        my $clone = $reader->copyCurrentNode(0);
        $clone->setName('newparent');

        my $doc = XML::LibXML::Document->new( );
        $doc->setDocumentElement($doc->createElement('newroot'));
        $doc->documentElement()->appendChild($clone);

        my $filePrefix = $clone->getAttribute('attr1');
        my $fileName = "$filePrefix-result.xml";
        print "Writing file $fileName.\n";

        $doc->toFile($fileName, 1);

      }
  }

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig->new->parsefile('your_source.xml');

foreach my $parent_elt ( $twig->findnodes('//parent') ) {
    #remove children if that's what you want?
    $_->delete for $parent_elt->children();
    my $newname = $parent_elt->att('attr1');
    print "Opening:", $newname, "\n";
    my $new_doc = XML::Twig->new->set_root( XML::Twig::Elt->new('newroot') );
    $parent_elt->cut;
    $parent_elt->paste( $new_doc->root );

    $new_doc -> set_pretty_print ('indented_a'); 
    open( my $output, '>', "$newname.xml" ) or die $!;
    print {$output} $new_doc->sprint;
    close($output);
}

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

sub extract_parent_elt {
    my ( $twig, $parent_elt ) = @_;

    #remove children if that's what you want?
    $_->delete for $parent_elt->children();

    #pick out the attr for our file name
    my $newname = $parent_elt->att('attr1');
    print "Opening:", $newname, ".xml\n";

    #create a new document - insert a 'newroot' as it's root element. 
    my $new_doc = XML::Twig->new->set_root( XML::Twig::Elt->new('newroot') );
    #cut and paste this element into the new doc. 
    $parent_elt->cut;
    $parent_elt->paste( $new_doc->root );
    #note - because we're purging and not saving the 'old' doc, a
    #cut doesn't modify the original. 

    #open output:
    open( my $output, '>', "$newname.xml" ) or die $!;
    #note - pretty print has some limitations. 
    #specifically - there are some XML things that it breaks. 
    #your code doesn't _appear_ to have these. 
    $new_doc -> set_pretty_print('indented_a'); 
    print {$output} $new_doc->sprint;
    close($output);

    #discard everything so far.
    $twig->purge;
}

my $twig = XML::Twig->new( 
                twig_handlers => { 'parent' => \&extract_parent_elt } );

$twig -> parsefile('original.xml');