关于perl HTML解析的一点帮助_Html_Perl_Parsing_Expression

关于perl HTML解析的一点帮助

html perl parsing

关于perl HTML解析的一点帮助,html,perl,parsing,expression,Html,Perl,Parsing,Expression,我正在开发一个小的perl程序，它将打开一个站点，搜索单词Hail Reports，并将信息返回给我。我对perl非常陌生，因此有些问题可能很容易解决。首先，我的代码说我使用的是一个单位化的值。这是我的 #!/usr/bin/perl -w use LWP::Simple; my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html") or die "Could not fetch NWS page.";

我正在开发一个小的perl程序，它将打开一个站点，搜索单词Hail Reports，并将信息返回给我。我对perl非常陌生，因此有些问题可能很容易解决。首先，我的代码说我使用的是一个单位化的值。这是我的

#!/usr/bin/perl -w
use LWP::Simple;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";
$html =~ m{Hail Reports} || die;
my $hail = $1;
print "$hail\n";

第二，我认为正则表达式将是实现我想要的功能的最简单的方法，但我不确定是否可以使用它们。我希望我的程序搜索“冰雹报告”，并将冰雹报告和“风报告”之间的信息返回给我。这是正则表达式可以实现的，还是我应该使用不同的方法？下面是我希望它发回的网页源代码片段

     <tr><th colspan="8">Hail Reports (<a href="last3hours_hail.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_hail.csv">Raw Hail CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr> 

#The Data here will change throughout the day so normally there will be more info.
      <tr><td colspan="8" class="highlight" align="center">No reports received</td></tr> 
      <tr><th colspan="8">Wind Reports (<a href="last3hours_wind.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_wind.csv">Raw Wind CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr>

冰雹报告（）（） #这里的数据会在一天中发生变化，因此通常会有更多的信息。没有收到报告风力报告

您没有在$1中捕获任何内容，因为您的正则表达式都没有包含在括号中。以下内容适合我

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";

$html =~ m{Hail Reports(.*)Wind Reports}s || die; #Parentheses indicate capture group
my $hail = $1; # $1 contains whatever matched in the (.*) part of above regex
print "$hail\n";

您没有在$1中捕获任何内容，因为您的正则表达式都没有包含在括号中。以下内容适合我

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";

$html =~ m{Hail Reports(.*)Wind Reports}s || die; #Parentheses indicate capture group
my $hail = $1; # $1 contains whatever matched in the (.*) part of above regex
print "$hail\n";

括号捕获正则表达式中的字符串。您的正则表达式中没有括号，因此$1没有设置为任何值。如果你有：

$html =~ m{(Hail Reports)} || die;

如果$1存在于$html变量中，那么$1将被设置为“Hail Reports”。因为您只想知道它是否匹配，所以此时您真的不需要捕获任何内容，您可以编写如下内容：

unless ( $html =~ /Hail Reports/ ) {
  die "No Hail Reports in HTML";
}

if ( $html =~ /(?<=Hail Reports)(.*?)(?=Wind Reports)/s ) {
  print "Got $1\n";
}

要捕获字符串之间的内容，可以执行以下操作：

unless ( $html =~ /Hail Reports/ ) {
  die "No Hail Reports in HTML";
}

if ( $html =~ /(?<=Hail Reports)(.*?)(?=Wind Reports)/s ) {
  print "Got $1\n";
}

if（$html=~/（？括号捕获正则表达式中的字符串。您的正则表达式中没有括号，因此$1未设置为任何值。
如果你有：
$html =~ m{(Hail Reports)} || die;

如果$1存在于$html变量中，那么$1将被设置为“Hail Reports”。因为您只想知道它是否匹配，所以此时您真的不需要捕获任何内容，您可以编写如下内容：
unless ( $html =~ /Hail Reports/ ) {
  die "No Hail Reports in HTML";
}

if ( $html =~ /(?<=Hail Reports)(.*?)(?=Wind Reports)/s ) {
  print "Got $1\n";
}

要捕获字符串之间的内容，可以执行以下操作：
unless ( $html =~ /Hail Reports/ ) {
  die "No Hail Reports in HTML";
}

if ( $html =~ /(?<=Hail Reports)(.*?)(?=Wind Reports)/s ) {
  print "Got $1\n";
}

if（$html=~/（？未初始化值警告来自$1--未在任何地方定义或设置
对于行级别而不是字节级别“介于”之间，您可以使用：
for (split(/\n/, $html)) {
    print if (/Hail Reports/ .. /Wind Reports/ and !/(?:Hail|Wind) Reports/);
}

未初始化值警告来自$1——它没有在任何地方定义或设置
对于行级别而不是字节级别“介于”之间，您可以使用：
for (split(/\n/, $html)) {
    print if (/Hail Reports/ .. /Wind Reports/ and !/(?:Hail|Wind) Reports/);
}

使用单行和多行匹配。另外，它只拾取中间文本的第一个匹配，这比贪婪要快一点
#!/usr/bin/perl -w

use strict;
use LWP::Simple;

   sub main{
      my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
                 or die "Could not fetch NWS page.";

      # match single and multiple lines + not greedy
      my ($hail, $between, $wind) = $html =~ m/(Hail Reports)(.*?)(Wind Reports)/sm
                 or die "No Hail/Wind Reports";

      print qq{
               Hail:         $hail
               Wind:         $wind
               Between Text: $between
            };
   }

   main();

使用单行和多行匹配。另外，它只拾取中间文本的第一个匹配，这比贪婪要快一点
#!/usr/bin/perl -w

use strict;
use LWP::Simple;

   sub main{
      my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
                 or die "Could not fetch NWS page.";

      # match single and multiple lines + not greedy
      my ($hail, $between, $wind) = $html =~ m/(Hail Reports)(.*?)(Wind Reports)/sm
                 or die "No Hail/Wind Reports";

      print qq{
               Hail:         $hail
               Wind:         $wind
               Between Text: $between
            };
   }

   main();

你能用XPath试试吗？你能用XPath试试吗？谢谢，这很好地解决了这两个问题。谢谢，这很好地解决了这两个问题。你需要正则表达式上的“s”修饰符跨换行匹配，即=~/…/s你需要正则表达式上的“s”修饰符跨换行匹配，即=~/…/s