如何使用Perl:：Mechanize解析器迭代300页？_Perl_Parsing_Mechanize

如何使用Perl:：Mechanize解析器迭代300页？

perl parsing

如何使用Perl:：Mechanize解析器迭代300页？,perl,parsing,mechanize,Perl,Parsing,Mechanize,我已经编写了一个从页面中提取数据的小解析器 use strict; use warnings FATAL => qw#all#; use LWP::UserAgent; use HTML::TreeBuilder::XPath; use Data::Dumper; my $handler_relurl = sub { q#https://europa.eu# . $_[0] }; my $handler_trim = sub { $_[0] =~ s#

我已经编写了一个从页面中提取数据的小解析器

use strict; 
use warnings FATAL => qw#all#; 
use LWP::UserAgent; 
use HTML::TreeBuilder::XPath; 
use Data::Dumper; 

my $handler_relurl      = sub { q#https://europa.eu# . $_[0] }; 
my $handler_trim        = sub { $_[0] =~ s#^\s*(.+?)\s*$#$1#r }; 
my $handler_val         = sub { $_[0] =~ s#^[^:]+:\s*##r }; 
my $handler_split       = sub { [ split $_[0], $_[1] ] }; 
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) }; 
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) }; 

my $conf = 
{ 
    url      => q#https://europa.eu/youth/volunteering/evs-organisation_en#, 
    parent   => q#//div[@class="vp ey_block block-is-flex"]#, 
    children => 
    { 
        internal_url => [ q#//a/@href#, [ $handler_relurl ] ], 
        external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ], 
        title        => [ q#//h4# ], 
        topics       => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ], 
        location     => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ], 
        hand         => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ], 
        pic_number   => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ], 
    } 
}; 

print Dumper browse( $conf ); 

sub browse 
{ 
    my $conf = shift; 

    my $ref = [ ]; 

    my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 ); 
    my $response = $lwp_useragent->get( $conf->{url} ); 
    die $response->status_line unless $response->is_success; 
    my $content = $response->decoded_content; 

    my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content ); 
    my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} ); 
    for my $node ( @nodes ) 
    { 
        push @$ref, { };  

        while ( my ( $key, $val ) = each %{$conf->{children}} ) 
        { 
            my $xpath    = $val->[0]; 
            my $handlers = $val->[1] // [ ]; 

            $val = ($node->findvalues( qq#.$xpath# ))[0] // next; 
            $val = $_->( $val ) for @$handlers; 
            $ref->[-1]->{$key} = $val; 
        } 
    } 

    return $ref; 
}

乍一看，从一页到另一页的刮削问题可以通过不同的方法解决：

我们在页面底部有页码：例如，请参见：

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5

及

我们可以将此url设置为基础-

如果我们有一个数组，从中加载需要访问的URL-我们会遇到所有页面

注意：我们有6000多个结果，每一页上有21个代表一条记录的小条目：因此我们需要访问大约305页。我们可以增加页数（如上图所示）并计数到305页

硬编码的总页数是不实际的，因为它可能会有所不同。我们可以： -提取第一页的结果数，除以每页的结果数（21），然后四舍五入。 -从页面底部的“last”链接中提取url，创建URI对象并从查询字符串中读取页码

现在我想我得把所有的页面都循环一遍

my $url_pattern = 'https://europa.eu/youth/volunteering/evs-organisation_en&page=%s'; 

for my $page ( 0 .. $last ) 
{ 
    my $url = sprintf $url_pattern, $page; 

    ... 
}

或者我尝试将分页合并到$conf中，可能是一个迭代器，每次调用它都会获取下一个节点…

解析每个页面后，检查底部是否存在next›
链接。当您到达第292页时，没有更多的页面，因此您可以退出循环，例如..

soooo，问题是什么？如何在页面上循环-如何循环？我使用最后一个命令；它类似于C中的break语句（在循环中使用）；它立即退出有问题的循环。如果省略标签，则命令将引用最内层的封闭循环。最后一个EXPR允许在运行时计算标签名，并且在其他方面与最后一个标签相同。继续块（如果有的话）不会执行：`LINE:while（）{last LINE if/^$/；##在处理完头后退出#……}`我正在尝试将其合并到我的代码中。多亏了达西米使用了最后一个命令；它类似于C中的break语句（在循环中使用）；它立即退出有问题的循环。如果省略标签，则命令将引用最内层的封闭循环。最后一个EXPR允许在运行时计算标签名，并且在其他方面与最后一个标签相同。继续块（如果有的话）不会执行：`LINE:while（）{last LINE if/^$/；##在处理完头后退出#……}`我正在尝试将其合并到我的代码中。多亏了Daxim

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7

my $url_pattern = 'https://europa.eu/youth/volunteering/evs-organisation_en&page=%s'; 

for my $page ( 0 .. $last ) 
{ 
    my $url = sprintf $url_pattern, $page; 

    ... 
}