Regex 根据最近的特定html标记内容进行perl条件替换_Regex_Perl

Regex 根据最近的特定html标记内容进行perl条件替换

regex perl

Regex 根据最近的特定html标记内容进行perl条件替换,regex,perl,Regex,Perl,我有一个HTML片段，它是“cat”和“choose”的两个字典条目：更新：每个条目中可能有多个h2和span，它们总是成对出现。最终输出中的每个html标记的顺序都应该相同这就是我迄今为止所尝试的： if ($.<2 or $n){ $line = $_; $line =~ s,\n,,g; chomp $line; $hw = $line; $to_insert = ""; $before = 0; }else{

我有一个HTML片段，它是“cat”和“choose”的两个字典条目：

更新： 每个条目中可能有多个

h2

和

span

，它们总是成对出现。最终输出中的每个html标记的顺序都应该相同

这就是我迄今为止所尝试的：

if ($.<2 or $n){
    $line = $_;
    $line =~ s,\n,,g;
    chomp $line;
    $hw = $line;
    $to_insert = "";
    $before = 0;
}else{
    $to_insert = "";
    $before = 0;
    …
    # $all_families_all{reflection}{orig} contains all the reflections of an English word(in case its "adjective + cats" instead of "adjective + cat"), including original form.
    %all_families_all = %{$w_family{$hw}};
    $all = join "|", (keys %all_families_all);
    #hw in front of the plus sign +：
    if(/(<h2>)((?:(?!\b(?:$all)\b)[^<>])*\b(?:$all)\b[^<>]*)([+][^<>]*?)(?=<\/h2)/){
        $to_insert = $2;
        $before = 1;
    }
    # hw after the plus sign +：
    elsif(/(<h2>)((?:(?!\b(?:$all)\b)[^<>])*[+])([^<>]*?\b(?:$all)\b[^<>]*)(?=<\/h2>)/){
        $to_insert = $3;
        $before = 0;
    }elsif(/(<h2>)([^<>\+]*)(?=<\/h2>)/){
        $to_insert = "";
        $before = 0;
    }else{
        $to_insert = "";
        $before = 0;
    }

    s,(<span>)([^<>]*)(?=<\/span>), ($before ? "$1<pl>$to_insert</pl>$2" : "$1$2<pl>$to_insert</pl>"),ge if $to_insert;

}

$n=/<\/>/;

if（$。
但是，也就是说，这些数据看起来很有规律，你可能会侥幸逃脱
秘诀是使用$/
更改输入记录分隔符，以便一次可以处理一个块
一个简单的方法可能如下所示：
#!/usr/bin/perl

use strict;
use warnings;

# Set the input record separator
local $/ = "</>\n";

# This now reads a "chunk" at a time
while (<DATA>) {
  # Extract the contents of the <span>
  my ($span) = m|<span>(.+)</span>|;

  # If we have an adjective chunk, then extract the adjective
  if (m|<h2>adjective \+ (.+)</h2>|) {
    my $noun = $1;
    # Add the adjective to the words in the span
    $span =~ s/(\w+)(\W|$)/$1 $noun$2/g;
  }

  # If we have a verb chunk, then extract the verb
  if (m|<h2>(.+) \+ verb</h2>|) {
    my $verb = $1;
    # Add the verb to the words in the span
    $span =~ s/(^|\W)(\w)/$1$verb $2/g;
  }

  # Replace the span text with our new version
  s|<span>.+</span>|<span>$span</span>|;

  print;
}

__DATA__
cat
<h2>adjective + cat</h2>
<span>cute, pretty, small</span>
</>
choose
<h2>choose to + verb</h2>
<span>go, play, study</span>
</>

！/usr/bin/perl
严格使用；
使用警告；
#设置输入记录分隔符
本地$/=“\n”；
#现在一次读取一个“块”
而（）{
#提取
我的（$span）=m |（.+）|；
#如果我们有一个形容词块，那么提取这个形容词
if（m |形容词\+（.+）|）{
我的$noon=$1；
#将形容词添加到span中的单词中
$span=~s/（\w+）（\w |$）/$1$noun$2/g；
}
#如果我们有一个动词块，那么提取动词
if（m |（.+）\+动词|）{
我的$verb=$1；
#将动词添加到span中的单词中
$span=~s/（^ |\W）（\W）/$1$2/g；
}
#用我们的新版本替换span文本
s |.+|$span |；
印刷品；
}
__资料__
猫
形容词+猫
可爱，漂亮，小
选择
选择to+动词
去，玩，学习
我在回应。因为它不是一个完整的HTML文档，而且你可以假设它的结构永远不会改变，这里有一个类似于Dave的解决方案，它可以处理更多种类的单词，而不仅仅是形容词和动词
使用严格；
使用警告；
使用功能“说”；
{
本地$/=qq{\n}；
而（我的$record=）{
#每条记录是一组单词，用换行符分隔
我的（$word，$h2，$span）=拆分/\n/，$record；
#删除span标记并打断为单词
（我的$option_csv）=$span=~m/>（.+）（.+）
(.+)
<
}{>$all\u短语您的整个输入是否遵循您显示的结构，或者还有更多内容？特别是，
是否总是将记录分开？
将每个词典条目分开。我喜欢我们两人提出相同的想法。谢谢，@simbabque。我喜欢这个想法，尤其是在结构不变的情况下。我应该添加t如果h2
和span在一个条目中可以是多个，就像可能有另一只猫+verbmewow，claw，sleep在猫的条目中。这个代码可以适应吗？@jonah\w你应该从一开始就这么说。返回并添加额外的要求是不礼貌的。简化是一回事，但将事情排除在外t使我们的生活都变得更困难。因此，你需要代码来识别它是什么元素，然后在循环中处理它们。但是还有很多问题。每个元素是两个，还是一个元素的多个？如果每个元素都有多个，会发生什么？它们是如何匹配的？它们可以组合在输出中吗？顺序是固定的吗？而不知道它们之间的关系回答其中一些问题，我所能做的就是猜测。对不起，我刚刚意识到了这一点。我肯定会更加关注这一点。为了回答你的问题，它可以是多个，每个条目都应该按原始顺序组合在输出中。另外，它们总是成对出现：h2 span h2 span h2 span。@jonah看到我的更新。我希望这是不言自明的。
if ($.<2 or $n){
    $line = $_;
    $line =~ s,\n,,g;
    chomp $line;
    $hw = $line;
    $to_insert = "";
    $before = 0;
}else{
    $to_insert = "";
    $before = 0;
    …
    # $all_families_all{reflection}{orig} contains all the reflections of an English word(in case its "adjective + cats" instead of "adjective + cat"), including original form.
    %all_families_all = %{$w_family{$hw}};
    $all = join "|", (keys %all_families_all);
    #hw in front of the plus sign +：
    if(/(<h2>)((?:(?!\b(?:$all)\b)[^<>])*\b(?:$all)\b[^<>]*)([+][^<>]*?)(?=<\/h2)/){
        $to_insert = $2;
        $before = 1;
    }
    # hw after the plus sign +：
    elsif(/(<h2>)((?:(?!\b(?:$all)\b)[^<>])*[+])([^<>]*?\b(?:$all)\b[^<>]*)(?=<\/h2>)/){
        $to_insert = $3;
        $before = 0;
    }elsif(/(<h2>)([^<>\+]*)(?=<\/h2>)/){
        $to_insert = "";
        $before = 0;
    }else{
        $to_insert = "";
        $before = 0;
    }

    s,(<span>)([^<>]*)(?=<\/span>), ($before ? "$1<pl>$to_insert</pl>$2" : "$1$2<pl>$to_insert</pl>"),ge if $to_insert;

}

$n=/<\/>/;

#!/usr/bin/perl

use strict;
use warnings;

# Set the input record separator
local $/ = "</>\n";

# This now reads a "chunk" at a time
while (<DATA>) {
  # Extract the contents of the <span>
  my ($span) = m|<span>(.+)</span>|;

  # If we have an adjective chunk, then extract the adjective
  if (m|<h2>adjective \+ (.+)</h2>|) {
    my $noun = $1;
    # Add the adjective to the words in the span
    $span =~ s/(\w+)(\W|$)/$1 $noun$2/g;
  }

  # If we have a verb chunk, then extract the verb
  if (m|<h2>(.+) \+ verb</h2>|) {
    my $verb = $1;
    # Add the verb to the words in the span
    $span =~ s/(^|\W)(\w)/$1$verb $2/g;
  }

  # Replace the span text with our new version
  s|<span>.+</span>|<span>$span</span>|;

  print;
}

__DATA__
cat
<h2>adjective + cat</h2>
<span>cute, pretty, small</span>
</>
choose
<h2>choose to + verb</h2>
<span>go, play, study</span>
</>