Regex 使用Perl split函数解析html注释_Regex_Perl_Split

Regex 使用Perl split函数解析html注释

regex perl

Regex 使用Perl split函数解析html注释,regex,perl,split,Regex,Perl,Split,我有一个split函数，它根据空格和特殊字符分割.txt文档中的字符串，并将其转换为小写，以便计算文档中的总字数。我现在正试图扩展正则表达式，以便将整个html注释（包括其中的所有单词）视为分隔符，但我无法使更新的正则表达式正常工作 my @words = split /(?:([_\W\s\d]|(<(\w+)>.*<\/\>)))+/, $text; #count strings %count = (); foreach $word (@words) {

我有一个split函数，它根据空格和特殊字符分割.txt文档中的字符串，并将其转换为小写，以便计算文档中的总字数。我现在正试图扩展正则表达式，以便将整个html注释（包括其中的所有单词）视为分隔符，但我无法使更新的正则表达式正常工作

my @words = split /(?:([_\W\s\d]|(<(\w+)>.*<\/\>)))+/, $text;
 #count strings
  %count = ();
  foreach $word (@words) {
    @count{map lc, @keys} =
    map lc, delete @count{@keys = keys %count};
    $count{$word}++;
  }
   foreach $key (keys %count) {
    print $key, $count{$key};
   }

工作很好，但我不能得到第二个

 |(<(\w+).*\/\>)+

|（）+

若要正确运行，当一起使用时，第二个字符类无法正确运行，空格将被视为一个单词。理想情况下，所需的输出应该在空格和特殊字符之间分割单词，并分割html注释（实际上忽略注释标记之间的任何单词）

我不确定是否能够在拆分函数中使用两个字符类？仍在与正则表达式打交道

既然您说您正在解析

.txt

文档（带有嵌入的HTML注释），那么您可以试试。这是一个起点：

use strict;
use warnings;
use Regexp::Grammars;

my $parser = qr{   
          <nocontext:>
          <words>
          <token: words> (?:(?:<[word]><[separator]>?)|(?:<[separator]><[word]>?))+
          <token: word> <.wordchar>+
          <token: separator> <.comment> | (?:(?:(?!<.comment>)(?!<.wordchar>)).)+
          <token: wordchar> [a-zA-Z]
          <token: comment> \< <.wordchar>+ \> [^<]* \</\>
}sx;

my $fn = 'file.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $text = do { local $/; <$fh> };
close $fh;

if ($text =~ $parser) {
    for my $word (@{ $/{words}{word} } ) {
        print "'", $word, "'\n";
    }
}

使用严格；
使用警告；
使用Regexp:：语法；
我的$parser=qr{
(?:(?:?)|(?:?))+
+
| (?:(?:(?!)(?!)).)+
[a-zA-Z]
\<+\>[^使用正则表达式解析HTML注定会失败。请不要这样做。改用HTML解析器。是否可能？如果绝对需要使用正则表达式？使用正则表达式“解析”XML；HTML等。看看如何解析HTML文档并保留注释以及如何访问它们。我相信Perl正则表达式的一些非标准扩展意味着这是可能的。但这将是一个可怕的、巨大的、无法维护的正则表达式，需要几天的时间来开发和测试。这从来不是“绝对必要的”.总有其他选择。
use strict;
use warnings;
use Regexp::Grammars;

my $parser = qr{   
          <nocontext:>
          <words>
          <token: words> (?:(?:<[word]><[separator]>?)|(?:<[separator]><[word]>?))+
          <token: word> <.wordchar>+
          <token: separator> <.comment> | (?:(?:(?!<.comment>)(?!<.wordchar>)).)+
          <token: wordchar> [a-zA-Z]
          <token: comment> \< <.wordchar>+ \> [^<]* \</\>
}sx;

my $fn = 'file.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $text = do { local $/; <$fh> };
close $fh;

if ($text =~ $parser) {
    for my $word (@{ $/{words}{word} } ) {
        print "'", $word, "'\n";
    }
}