使用perl从html页面解析域_Html_Perl_Parsing_Url_Dns

使用perl从html页面解析域

html perl parsing url dns

使用perl从html页面解析域,html,perl,parsing,url,dns,Html,Perl,Parsing,Url,Dns,我有一个包含URL的html页面，如： <h3><a href="http://site.com/path/index.php" h="blablabla"> <h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla"> 在之间，代码的主要错误在于，当您在标量上下文中调用split时，就像在您的行中一样： my $values1 = split('ht

我有一个包含URL的html页面，如：

<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">

在

之间，代码的主要错误在于，当您在标量上下文中调用split
时，就像在您的行中一样：
my $values1 = split('http://', $_);

它返回由拆分创建的列表的大小。看
但无论如何，我认为split
不适合这个任务。如果您知道您要查找的值总是介于“http[s]：/”和“/index.php”之间，那么您只需要在循环中替换正则表达式（打开文件时也应该更加小心……）：
open（my$myfile_fh），代码的主要错误在于，当您在标量上下文中调用split
时，就像在您的行中一样：
my $values1 = split('http://', $_);

它返回由拆分创建的列表的大小。请参阅
但是我认为不管怎样，split
都不适合此任务。如果您知道您要查找的值总是介于“http[s]：//”和“/index.php”之间，那么您只需要在循环中替换正则表达式（打开文件时也应该更加小心……）：
open（my$myfile_fh），解释了为什么使用split
不是最佳解决方案：

它返回标量上下文中的项数
普通正则表达式更适合此任务

但是，我不认为基于行的输入处理对HTML是有效的，或者使用替换是有意义的（尤其是当模式看起来像*pattern.*
）
给定URL，我们可以提取所需的信息，如
if ($url =~ m{^https?://(.+?)/index\.php}s) {  # domain+path now in $1
  say $1;
}

但是我们如何提取URL呢？我建议使用美妙的Mojolicous套件
use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp';  # makes it easy to read files.
use Mojo;

my $html_file = shift @ARGV;  # take file name from command line

my $dom = Mojo::DOM->new(scalar slurp $html_file);

for my $link ($dom->find('a[href]')->each) {
  say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}

find
方法可以使用CSS选择器（此处：所有具有href
属性的a
元素）。每个方法将结果集展平到一个列表中，我们可以循环使用
当我打印到STDOUT时，我们可以使用shell重定向将输出放入想要的文件中，例如
$ perl the-script.pl html-with-links.html >only-links.txt

将整个脚本作为一行代码：
$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'

此处解释了为什么使用split
不是最佳解决方案：

它返回标量上下文中的项数
普通正则表达式更适合此任务

但是，我不认为基于行的输入处理对HTML是有效的，或者使用替换是有意义的（尤其是当模式看起来像*pattern.*
）
给定URL，我们可以提取所需的信息，如
if ($url =~ m{^https?://(.+?)/index\.php}s) {  # domain+path now in $1
  say $1;
}

但是我们如何提取URL呢？我建议使用美妙的Mojolicous套件
use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp';  # makes it easy to read files.
use Mojo;

my $html_file = shift @ARGV;  # take file name from command line

my $dom = Mojo::DOM->new(scalar slurp $html_file);

for my $link ($dom->find('a[href]')->each) {
  say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}

find
方法可以使用CSS选择器（此处：所有具有href
属性的a
元素）。每个方法将结果集展平到一个列表中，我们可以循环使用
当我打印到STDOUT时，我们可以使用shell重定向将输出放入想要的文件中，例如
$ perl the-script.pl html-with-links.html >only-links.txt

将整个脚本作为一行代码：
$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'

在我看来，这就像是一个模块的工作




通常，使用regexp解析HTML是有风险的。
我觉得这就像是模块的工作




通常使用regexp解析HTML是有风险的。
您在my$values1=…
行中拆分$\ucode>，但除非您在命令行上传递了某些内容，否则此变量没有定义的值。您应该拆分可以明确标识的内容，以了解结果的含义。$
由lin设置ewhile（）
这是一种常见的Perl惯用法，您在my$values1=…
行中拆分$\uucode>，但是除非您在命令行上传递了一些内容，否则此变量没有定义的值。您应该拆分可以明确标识的内容，以了解结果的含义。$
由行设置，而（）
这是dms中常见的Perl惯用语，但我无法将输出保存到文件中。我尝试过：打开（我的$sort，'tt8-4.txt'）或死“无法打开$！”，而（）；但当您打开save
时，它不起作用，就像您打开它阅读一样。要打开它进行写作，您需要使用'>'或进行追加（这是您想要做的），请使用'>>'。如下图所示：open（$save，'>>，'1.txt'））
。您也应该将打开
移动到循环之外。hi dms，但我无法将输出保存到文件中。我尝试过：打开（我的$sort，'tt8-4.txt'）或死亡“无法打开$！”；而（）{（s{.*.http[s]？：/（.*）/index\.php.*{$1}）；打开（保存，“1.txt”）或死亡“$！”；打印保存“$1\n”；关闭（保存）}（$sort）；但是当您打开save
时，它不起作用，就像您打开它阅读一样。要打开它进行写作，您需要使用'>'或进行追加（这是您想要做的），请使用'>>'。如下：open（$save，'>>，'1.txt'）
。您还应该将open
移动到循环之外。