Perl 在<；a>；使用WWW：：Mechanize的元素_Perl_Hyperlink_Www Mechanize

Perl 在<；a>；使用WWW：：Mechanize的元素

perl hyperlink

Perl 在<；a>；使用WWW：：Mechanize的元素,perl,hyperlink,www-mechanize,Perl,Hyperlink,Www Mechanize,我正在使用提取HTML页面中的特殊链接 my$mech=WWW:：Mechanize->new（）； $mech->get（$uri）；我的@links=$mech->查找所有链接（url\u regex=>qr/cgi-bin/）；我的$link（@links）{ #试着把一切都安排好 } 链接看起来像这样 <a href="[...]"><div><div><span>foo bar</span> I WANT THIS TE

我正在使用提取HTML页面中的特殊链接

my$mech=WWW:：Mechanize->new（）；
$mech->get（$uri）；
我的@links=$mech->查找所有链接（url\u regex=>qr/cgi-bin/）；
我的$link（@links）{
#试着把一切都安排好
}

链接看起来像这样

<a href="[...]"><div><div><span>foo bar</span> I WANT THIS TEXT</div></div></a>

通过使用

$link->text

我得到

foo-bar，我想要这个文本

，而不知道

元素中有哪些文本

有没有办法获得原始HTML代码而不是剥离的文本

换句话说，我需要找到一种只获取

我想要此文本的方法，而不知道
标记中的确切文本。
，因为使用WWW:：Mechanize

事实上，如果您不想要它的任何特性，那么使用WWW:：Mechanize
几乎没有什么意义。如果您使用它的目的只是获取一个网页，那么请改用WWW:：Mechanize
只是LWP:：UserAgent
的一个子类，其中包含许多您不想要的附加内容
下面是一个用于构造HTML解析树并定位所需链接的示例。我使用了HTML:：TreeBuilder
，因为它非常擅长以类似于现代浏览器的方式容忍格式错误的HTML
我无法测试它，因为您没有提供适当的样本数据，我也不想创建自己的样本数据
use strict;
use warnings 'all';
use feature 'say';

use WWW::Mechanize;
use HTML::TreeBuilder;

my $mech = WWW::Mechanize->new;
$mech->get('http://www.example.com/');

my $tree = HTML::TreeBuilder->new_from_content($mech->content);

for my $link ( @{ $tree->extract_links('a') } ) {

    my ($href, $elem, $attr, $tag) = @$link;

    # Exclude non-CGI links
    next unless $link =~ /cgi-bin/;

    # Find all immediate child text nodes and concatenate them
    # References are non-text children
    my $text = join ' ', grep { not ref } $elem->content_list;
    next unless $text =~ /\S/;

    # Trim and consolidate spaces
    $text =~ s/\A\s+|\s+\z//g;
    $text =~ s/\s+/ /g;

    say $text;
}

因为你不能用WWW:：Mechanize

事实上，如果您不想要它的任何特性，那么使用WWW:：Mechanize
几乎没有什么意义。如果您使用它的目的只是获取一个网页，那么请改用WWW:：Mechanize
只是LWP:：UserAgent
的一个子类，其中包含许多您不想要的附加内容
下面是一个用于构造HTML解析树并定位所需链接的示例。我使用了HTML:：TreeBuilder
，因为它非常擅长以类似于现代浏览器的方式容忍格式错误的HTML
我无法测试它，因为您没有提供适当的样本数据，我也不想创建自己的样本数据
use strict;
use warnings 'all';
use feature 'say';

use WWW::Mechanize;
use HTML::TreeBuilder;

my $mech = WWW::Mechanize->new;
$mech->get('http://www.example.com/');

my $tree = HTML::TreeBuilder->new_from_content($mech->content);

for my $link ( @{ $tree->extract_links('a') } ) {

    my ($href, $elem, $attr, $tag) = @$link;

    # Exclude non-CGI links
    next unless $link =~ /cgi-bin/;

    # Find all immediate child text nodes and concatenate them
    # References are non-text children
    my $text = join ' ', grep { not ref } $elem->content_list;
    next unless $text =~ /\S/;

    # Trim and consolidate spaces
    $text =~ s/\A\s+|\s+\z//g;
    $text =~ s/\s+/ /g;

    say $text;
}

我相信与@链接中的内容无关。根据模块（）的代码，他们不知道自己来自哪里。我认为您必须从mechanize中获取整个页面的HTML，并使用不同的解析器。是将解析器内容转换为WWW:：Mechanize:：Link对象的地方。我不知道你会怎么逃避。我建议对$mech->content
使用一个不同的解析器。请提供一个示例HTML页面进行测试。我相信@links
中的东西不会。根据模块（）的代码，他们不知道自己来自哪里。我认为您必须从mechanize中获取整个页面的HTML，并使用不同的解析器。是将解析器内容转换为WWW:：Mechanize:：Link对象的地方。我不知道你会怎么逃避。我建议对$mech->content
使用不同的解析器。请提供一个示例HTML页面进行测试。