Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/perl/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/selenium/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
分割和处理XML文件_Xml_Perl - Fatal编程技术网

分割和处理XML文件

分割和处理XML文件,xml,perl,Xml,Perl,我有一个original.xml文件,其结构如下: <root> <parent attr1="val1" attr2="val2" ... > <child key1="val3" key2="val4" ... /> <child key1="val5" key2="val6" ... /> ... </parent> ... <parent at

我有一个
original.xml
文件,其结构如下:

<root>
    <parent attr1="val1" attr2="val2" ... >
        <child key1="val3" key2="val4" ... />
        <child key1="val5" key2="val6" ... />
        ...
    </parent>
    ...
    <parent attr1="val7" attr2="val8" ... />
    ...
</root>
为此,我使用以下perl单行命令:

perl -p -i -e 'open(F, ">", "new/".($1).".xml") if /<parent attr1="(.*)" attr2="(.*)" ... /; print {F} "<newroot><newparent attr1=\"".($1)."\" attr2=\"".($2)."\" /></newroot>";' "original.xml"

我不太明白为什么会这样。如何使perl命令只输出一个包含所需数据的
newroot
元素?

我建议使用XSLT来解决这个问题,例如,LibXSLT支持
exsl:document
(请参阅),这样您就可以编写XSLT样式表来实现这一点

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl"
  exclude-result-prefixes="exsl"
  version="1.0">

<xsl:template match="/">
  <xsl:apply-templates select="root/parent[@attr1]" mode="new"/>
</xsl:template>

<xsl:template match="parent" mode="new">
  <xsl:message>Writing file <xsl:value-of select="@attr1"/></xsl:message>
  <exsl:document href="{@attr1}.xml" method="xml" indent="yes">
    <newroot>
      <xsl:apply-templates select="."/>
    </newroot>
  </exsl:document>
</xsl:template>

<xsl:template match="parent">
  <newparent>
    <xsl:copy-of select="@*"/>
  </newparent>
</xsl:template>

</xsl:stylesheet>
正如您在评论中指出的,您遇到了内存和/或性能问题,这里有一种替代方法,使用
LibXML::Reader
,它是一种前向读取拉解析器,不会将完整的XML加载到内存中的树结构中:

  use strict;
  use warnings;
  use XML::LibXML::Reader;
  use XML::LibXML;

  my $reader = XML::LibXML::Reader->new(location => "input.xml")
         or die "cannot read file.xml\n";

  while ($reader->read) {
    processNode($reader);
  }

  sub processNode {
      my $reader = shift;

      if ($reader->nodeType == XML_READER_TYPE_ELEMENT && $reader->name eq "parent")
      {
        my $clone = $reader->copyCurrentNode(0);
        $clone->setName('newparent');

        my $doc = XML::LibXML::Document->new( );
        $doc->setDocumentElement($doc->createElement('newroot'));
        $doc->documentElement()->appendChild($clone);

        my $filePrefix = $clone->getAttribute('attr1');
        my $fileName = "$filePrefix-result.xml";
        print "Writing file $fileName.\n";

        $doc->toFile($fileName, 1);

      }
  }

我希望它可以避免内存问题。

不要使用正则表达式。使用解析器。我很喜欢。(
XML::LibXML
也很不错)

给定您的示例数据,这大致满足您的要求

好吧,那么你没有提到的事情是:


马丁非常感谢。这很有效。但有一件事我忘了提。My original.xml有超过一百万个父标记

这一点非常重要——XML是一个标记匹配过程,这意味着它不能确保它已经完成,并且在它到达末尾之前XML是有效的。这通常意味着解析整个文档以验证标记是否匹配

XML的缺点之一是它的内存占用通常是文件大小的10倍左右

然而,
XML::Twig
还有另一个有用的特性--
Twig\u处理程序
清除

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

sub extract_parent_elt {
    my ( $twig, $parent_elt ) = @_;

    #remove children if that's what you want?
    $_->delete for $parent_elt->children();

    #pick out the attr for our file name
    my $newname = $parent_elt->att('attr1');
    print "Opening:", $newname, ".xml\n";

    #create a new document - insert a 'newroot' as it's root element. 
    my $new_doc = XML::Twig->new->set_root( XML::Twig::Elt->new('newroot') );
    #cut and paste this element into the new doc. 
    $parent_elt->cut;
    $parent_elt->paste( $new_doc->root );
    #note - because we're purging and not saving the 'old' doc, a
    #cut doesn't modify the original. 

    #open output:
    open( my $output, '>', "$newname.xml" ) or die $!;
    #note - pretty print has some limitations. 
    #specifically - there are some XML things that it breaks. 
    #your code doesn't _appear_ to have these. 
    $new_doc -> set_pretty_print('indented_a'); 
    print {$output} $new_doc->sprint;
    close($output);

    #discard everything so far.
    $twig->purge;
}

my $twig = XML::Twig->new( 
                twig_handlers => { 'parent' => \&extract_parent_elt } );

$twig -> parsefile('original.xml');

处理程序在找到匹配的“close”元素时启动,并获得该XML块
purge
告诉twig放弃迄今为止已处理的任何内容(例如,任何具有“已关闭”标记的内容)

您考虑过使用XSLT吗?我认为libxslt可以从Perl中使用,并且它支持exsl:document,这样就可以正确处理XML了。我肯定要花几个星期的时间来牢牢掌握它,但是对于这个特殊的问题,我感觉切换到XSLT需要花费太长的时间,因为它几乎完成了,只有这一个bug破坏了它;打印“…”让我觉得很奇怪:如果if块中没有任何内容,if将实现什么?@bytepusher:这是作为语句修饰符的
if
。它的
打开(…)if/../@BytePasher我不确定我是否理解。我需要
if
块,以便重用属性值。Martin非常感谢。这很有效。但有一件事我忘了提。我的
original.xml
有超过一百万个
parent
标记。当我运行你的代码时,它确实按预期工作,但它完全阻塞了我的计算机。完成这一切可能需要几个小时,甚至几天。另一方面,我以前的脚本速度更快,效率更高。这就是为什么我决定首先使用它。有什么方法可以坚持使用perl吗?@我添加了另一种方法,在读取大文件时性能会更好。Sobrique非常感谢。我试过你的代码,它也运行得很好,但与Martin的代码有相同的缺点。请看一看。XML的大小在问题中很重要。但是是的,这是可以通过细枝处理器实现的。下一步靠近键盘时,我将编辑一个示例
  use XML::LibXSLT;
  use XML::LibXML;

  my $xslt = XML::LibXSLT->new();

  my $source = XML::LibXML->load_xml(location => 'original.xml');
  my $style_doc = XML::LibXML->load_xml(location => 'sheet1.xsl');

  my $stylesheet = $xslt->parse_stylesheet($style_doc);

  my $results = $stylesheet->transform($source);

  print $stylesheet->output_as_bytes($results);
  use strict;
  use warnings;
  use XML::LibXML::Reader;
  use XML::LibXML;

  my $reader = XML::LibXML::Reader->new(location => "input.xml")
         or die "cannot read file.xml\n";

  while ($reader->read) {
    processNode($reader);
  }

  sub processNode {
      my $reader = shift;

      if ($reader->nodeType == XML_READER_TYPE_ELEMENT && $reader->name eq "parent")
      {
        my $clone = $reader->copyCurrentNode(0);
        $clone->setName('newparent');

        my $doc = XML::LibXML::Document->new( );
        $doc->setDocumentElement($doc->createElement('newroot'));
        $doc->documentElement()->appendChild($clone);

        my $filePrefix = $clone->getAttribute('attr1');
        my $fileName = "$filePrefix-result.xml";
        print "Writing file $fileName.\n";

        $doc->toFile($fileName, 1);

      }
  }
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig->new->parsefile('your_source.xml');

foreach my $parent_elt ( $twig->findnodes('//parent') ) {
    #remove children if that's what you want?
    $_->delete for $parent_elt->children();
    my $newname = $parent_elt->att('attr1');
    print "Opening:", $newname, "\n";
    my $new_doc = XML::Twig->new->set_root( XML::Twig::Elt->new('newroot') );
    $parent_elt->cut;
    $parent_elt->paste( $new_doc->root );

    $new_doc -> set_pretty_print ('indented_a'); 
    open( my $output, '>', "$newname.xml" ) or die $!;
    print {$output} $new_doc->sprint;
    close($output);
}
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

sub extract_parent_elt {
    my ( $twig, $parent_elt ) = @_;

    #remove children if that's what you want?
    $_->delete for $parent_elt->children();

    #pick out the attr for our file name
    my $newname = $parent_elt->att('attr1');
    print "Opening:", $newname, ".xml\n";

    #create a new document - insert a 'newroot' as it's root element. 
    my $new_doc = XML::Twig->new->set_root( XML::Twig::Elt->new('newroot') );
    #cut and paste this element into the new doc. 
    $parent_elt->cut;
    $parent_elt->paste( $new_doc->root );
    #note - because we're purging and not saving the 'old' doc, a
    #cut doesn't modify the original. 

    #open output:
    open( my $output, '>', "$newname.xml" ) or die $!;
    #note - pretty print has some limitations. 
    #specifically - there are some XML things that it breaks. 
    #your code doesn't _appear_ to have these. 
    $new_doc -> set_pretty_print('indented_a'); 
    print {$output} $new_doc->sprint;
    close($output);

    #discard everything so far.
    $twig->purge;
}

my $twig = XML::Twig->new( 
                twig_handlers => { 'parent' => \&extract_parent_elt } );

$twig -> parsefile('original.xml');