如何在Perl中修改复杂的XML文档以向文本节点添加额外的标记？_Xml_Perl_Xml Twig

如何在Perl中修改复杂的XML文档以向文本节点添加额外的标记？

xml perl

如何在Perl中修改复杂的XML文档以向文本节点添加额外的标记？,xml,perl,xml-twig,Xml,Perl,Xml Twig,我有这样一个XML文档： <article> <author>Smith</author> <date>2011-10-10</date> <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description> </article> <

我有这样一个XML文档：

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>
</article>

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b><a href="dictionary.html#frobnitz">frobnitz</a></b>, <a href="dictionary.html#crulps">crulps</a> and <a href="dictionary.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
</article>

并为描述标记设置“细枝处理程序”。但是，当我调用$node->text时，我会删除带有中间标记的文本。实际上，我想做的是遍历（非常小的）树，以便保留现有的标记而不破坏它们。因此，最终的XML输出应该如下所示：
<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>
</article>

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b><a href="dictionary.html#frobnitz">frobnitz</a></b>, <a href="dictionary.html#crulps">crulps</a> and <a href="dictionary.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
</article>

，和的。主要是弗罗布尼茨

我也可以在目标环境中使用，但我不确定如何从那里开始
这是到目前为止我的最小测试用例。谢谢你的帮助
#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my %dictionary = (
    frobnitz    => 'dictionary.html#frobnitz',
    crulps      => 'dictionary.html#crulps',
    furtykurty  => 'dictionary.html#furtykurty',
    );

sub markup_plain_text { 
    my ( $text ) = @_;

    foreach my $k ( keys %dictionary ) {
        $text =~ s/(^|\W)($k)(\W|$)}/$1<a href="$dictionary{$k}">$2<\/a>$3/si;
    }

    return $text;
}

sub convert {
    my( $t, $node ) = @_;
    warn "convert: TEXT=[" . $node->text . "]\n";
    $node->set_text( markup_plain_text($node->text) );
    return 1;
}

sub markup {
    my ( $text ) = @_;

    my $t = XML::Twig->new(
        twig_handlers => { description => \&convert },
        pretty_print  => 'indented',
        );
    $t->parse( $text );

    return $t->flush;
}


my $orig = <<END_XML;
<article>
    <author>Smith</author>
    <date>2011-10-10</date>
    <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz's</description>
</article>
END_XML
;

markup($orig);

#/usr/bin/perl
严格使用；
使用警告；
使用XML：：Twig；
我的%dictionary=(
frobnitz=>'dictionary.html#frobnitz'，
crulps=>'dictionary.html#crulps'，
furtykurty=>“dictionary.html#furtykurty”，
);
子标记\u纯文本{
我的（$text）=@；
foreach my$k（关键字%dictionary）{
$text=~s/（^ |\W）（$k）（\W |$）}/$1$2$3/si；
}
返回$text；
}
子转换{
我的（$t，$node）=@；
警告“转换：文本=[”$node->TEXT.]\n”；
$node->set_text（标记_纯文本（$node->text））；
返回1；
}
子标记{
我的（$text）=@；
my$t=XML:：Twig->new(
细枝处理程序=>{description=>\&convert}，
漂亮的打印=>“缩进”，
);
$t->parse（$text）；
返回$t->flush；
}
我的$orig=有点棘手，但XML:：Twig是为这种处理而设计的（我经常使用它）。因此，有一种称为mark
的特定方法，它接受一个regexp并标记匹配项
在这种情况下，regexp可能相当大。我使用Regexp:：assemble来构建它，因此它得到了优化。然后另一个问题是，mark
不允许您使用匹配的文本来设置属性（我可能会在下一个版本的模块中处理这个问题，这会很有用），因此我必须先标记，然后在第二次传递中返回并设置href
属性（在任何情况下，第二次传递都需要“取消链接”已链接的单词）
最后一句话：我几乎放弃了编写解决方案，因为您的示例数据有一些拼写错误。没有什么比正确编写代码更糟糕的了，只是看到测试仍然失败，因为代码中使用了“dictionary”，数据中使用了“definitions”，或者“furtykurtle”、“furtikurty”和“furtikurty”，它们应该是同一个词。因此，在发布之前，请确保您的数据是正确的。谢天谢地，我写代码是为了测试
#!/usr/bin/perl 

use strict;
use warnings;

use XML::Twig;
use Regexp::Assemble;

use Test::More tests => 1; 
use autodie qw(open);

my %dictionary = (
    frobnitz    => 'definitions.html#frobnitz',
    crulps      => 'definitions.html#crulps',
    furtikurty  => 'definitions.html#furtikurty',
    );

my $match_defs= Regexp::Assemble->new()
                                ->add( keys %dictionary)
                                ->anchor_word
                                ->as_string;
# I am not familiar enough with Regexp::Assemble to know a cleaner
# way to get get the capturing braces in the regexp
$match_defs= qr/($match_defs)/; 

my $in       = data_para(); 
my $expected = data_para();
my $out;
open( my $out_fh, '>', \$out);


XML::Twig->new( twig_roots => { 'description' => sub { tag_defs( @_, $out_fh, $match_defs, \%dictionary); } },
                twig_print_outside_roots => $out_fh, 
              )
         ->parse( $in);

is( $out, $expected, 'base test');
exit;

sub tag_defs
  { my( $t, $description, $out_fh, $match_defs, $dictionary)= @_;

    my @a= $description->mark( $match_defs, 'a' );

    # word => 1 when already used in this description
    # this might need to have a different scope if you need to tag
    # only the first time the word appears in a section or whatever
    my $tagged_in_description; 

    foreach my $a (@a) 
      { my $word= $a->text;
        warn "checking a: ", $a->sprint, "\n";

        if( $tagged_in_description->{$word})
          { $a->erase; } # we did not need to tag it after all
        else
          { $a->set_att( href => $dictionary->{$word}); }
        $tagged_in_description->{$word}++;
      }

    $t->flush( $out_fh); }


sub def_href
  { my( $word)= @_;
    return $dictionary{word};
  }

sub data_para
  { local $/="\n\n";
    my $para= <DATA>;
    return $para;
  }

__DATA__
<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b>frobnitz</b>, crulps and furtikurty's. Mainly frobnitz</description>
</article>

<article>
  <author>Smith</author>
  <date>2011-10-10</date>
  <description>Article about <b><a href="definitions.html#frobnitz">frobnitz</a></b>, <a href="definitions.html#crulps">crulps</a> and <a href="definitions.html#furtikurty">furtikurty</a>'s. Mainly frobnitz</description>
</article>

#/usr/bin/perl
严格使用；
使用警告；
使用XML：：Twig；
使用Regexp:：Assemble；
使用测试：：更多测试=>1；
使用自动模具qw（打开）；
我的%dictionary=(
frobnitz=>'definitions.html#frobnitz'，
crulps=>'definitions.html#crulps'，
furtikurty=>“definitions.html#furtikurty”，
);
我的$match\u defs=Regexp:：assembly->new（）
->添加（关键字%dictionary）
->锚定词
->as_字符串；
#我对Regexp:：assembly不够熟悉，不知道如何使用它
#在regexp中获取捕获括号的方法
$match_defs=qr/（$match_defs）/；
my$in=data_para（）；
my$expected=数据_para（）；
我的美元用完了；
打开（我的$out\U fh，“>”，\$out）；
XML:：Twig->new（Twig\u root=>{'description'=>sub{tag\u defs（@，$out\u fh，$match\u defs，\%dictionary）；}，
小枝打印外部根=>$out\u fh，
)
->（单位：元）；
是（$out，$expected，'base test'）；
出口
子标签_defs
{my（$t，$description，$out\fh，$match\u defs，$dictionary）=；
my@a=$description->mark（$match_defs，'a'）；
#word=>1（如果已在本说明中使用）
#如果需要标记，则可能需要具有不同的作用域
#仅当该词第一次出现在某个部分或任何内容中时
我的$tagged_在_描述中；
foreach我的$a（@a）
{my$word=$a->text；
警告“检查a:”，$a->sprint，“\n”；
if（$taged_in_description->{$word}）
{$a->erase；}#我们根本不需要给它贴标签
其他的
{$a->set_att（href=>$dictionary->{$word}）；}
$tagged_in_description->{$word}++；
}
$t->flush（$out_fh）；}
sub def_href
{my（$word）=@；
返回$dictionary{word}；
}
次级数据
{local$/=“\n\n”；
我的$para=；
返回$para；
}
__资料__
，及。主要是弗罗布尼茨
我没有网络编程经验，但将html放入xml中不是有点奇怪吗？是的，是的。这个特定的CMS有一个DTD，允许HTML的子集。在生产环境中，XML被第二次转换，HTML被包装在CDATA标记中——这有点正常。对于实际模块的作者来说，回答这个问题是非常棒的。我觉得我问了一个高尔夫问题，老虎伍兹回答了。谢谢对测试数据感到抱歉——我自己的努力毫无结果（试图使用insert添加新的Elt），所以我还没有注意到拼写错误。快速提问——我必须添加一个“chomp$in；”来让测试真正通过——$in版本有一个额外的尾随换行符。Perl版本的差异？（我使用的是5.10.1）谢谢，我试图回答有关XML:：Twig的问题，良好的客户服务是模块成功的关键；--）由于我没有正确地复制它（所以你知道你不是唯一一个打字的人！），在当前版本（0.35）的Regexp:：Assemble中，你可以使用->re来获取编译后的正则表达式，而不是->as\u字符串，然后手动应用qr//