如何使用多线程查看Perl中是否存在网页？_Perl_Networking

如何使用多线程查看Perl中是否存在网页？

perl networking

如何使用多线程查看Perl中是否存在网页？,perl,networking,Perl,Networking,我正在编写一个Perl脚本，它接收URL列表并检查它们是否存在。（请注意，我只关心它们是否存在；我不关心它们的内容是什么。这是程序的重要部分 use LWP::Simple qw($ua head); if (head($url)) { $numberAlive ++; } else { $numberDead ++; } 现在这个程序运行得很好；但是，我希望它运行得更快。因此，我正在考虑将其设置为多线程。我假设程序的慢部分是针对每个URL与服务器联系；因此，我正在寻找一种方

我正在编写一个Perl脚本，它接收URL列表并检查它们是否存在。（请注意，我只关心它们是否存在；我不关心它们的内容是什么。这是程序的重要部分

use LWP::Simple qw($ua head);

if (head($url))
{
    $numberAlive ++;
}
else
{
    $numberDead ++;
}

现在这个程序运行得很好；但是，我希望它运行得更快。因此，我正在考虑将其设置为多线程。我假设程序的慢部分是针对每个URL与服务器联系；因此，我正在寻找一种方法，在等待第一个响应时，我可以向列表中其他网页的URL发送请求。我该怎么做？据我所知，

head

例程没有一个回调函数，一旦服务器做出响应，就可以调用它。

从熟悉的前端问题开始

#! /usr/bin/env perl

use strict;
use warnings;

use 5.10.0;  # for // (defined-or)

use IO::Handle;
use IO::Select;
use LWP::Simple;
use POSIX qw/ :sys_wait_h /;
use Socket;

全局常数控制程序执行

my $DEBUG = 0;
my $EXIT_COMMAND = "<EXIT>";
my $NJOBS = 10;

要创建工作进程，我们首先创建一个。父进程将使用一端，每个工作进程（子进程）将使用另一端。我们在两端禁用缓冲，并将父进程添加到IO:：Select实例。我们还记录每个子进程的进程ID，以便等待所有工作进程完成

sub create_worker {
  my($sel,$kidpid) = @_;

  socketpair my $parent, my $kid, AF_UNIX, SOCK_STREAM, PF_UNSPEC
    or die "$0: socketpair: $!";
  $_->autoflush(1) for $parent, $kid;

  my $pid = fork // die "$0: fork: $!";
  if ($pid) {
    ++$kidpid->{$pid};
    close $kid or die "$0: close: $!";
    $sel->add($parent);
  }
  else {
    close $parent or die "$0: close: $!";
    check_sites $kid;
    die "NOTREACHED";
  }
}

要分派URL，父级将获取尽可能多的可用读卡器，并从作业队列中分发相同数量的URL。作业队列为空后剩余的任何工作线程都将接收exit命令

请注意，如果基础工作进程已退出，

print

将失败。父进程必须忽略

SIGPIPE

，以防止立即终止

sub dispatch_jobs {
  my($sel,$jobs) = @_;

  foreach my $s ($sel->can_write) {
    my $url = @$jobs ? shift @$jobs : $EXIT_COMMAND;
    warn "$0 [$$]: sending '$url' to fd ", fileno $s if $DEBUG;
    print $s $url, "\n" or $sel->remove($s);
  }
}

当控件到达

读取\u结果

时，工作进程已创建并收到工作。现在，父进程用于等待一个或多个工作进程的结果。已定义的结果是当前工作进程的应答，未定义的结果表示子进程已退出并关闭套接字的另一端

sub read_results {
  my($sel,$results) = @_;

  warn "$0 [$$]: waiting for readers" if $DEBUG;
  foreach my $s ($sel->can_read) {
    warn "$0: [$$]: reading from fd ", fileno $s if $DEBUG;
    if (defined(my $result = <$s>)) {
      chomp $result;
      push @$results, $result;
      warn "$0 [$$]: got '$result' from fd ", fileno $s if $DEBUG;
    }
    else {
      warn "$0 [$$]: eof from fd ", fileno $s if $DEBUG;
      $sel->remove($s);
    }
  }
}

运行池将执行上面的sub以分派所有URL并返回所有结果

sub reap_workers {
  my($kidpid) = @_;

  while ((my $pid = waitpid -1, WNOHANG) > 0) {
    warn "$0: [$$]: reaped $pid" if $DEBUG;
    delete $kidpid->{$pid};
  }
}

sub run_pool {
  my($n,@jobs) = @_;

  my $sel = IO::Select->new;
  my %kidpid;
  my @results;

  create_worker $sel, \%kidpid for 1 .. $n;

  local $SIG{PIPE} = "IGNORE";  # writes to dead workers will fail

  while (@jobs || keys %kidpid || $sel->handles) {
    dispatch_jobs $sel, \@jobs;

    read_results $sel, \@results;

    reap_workers \%kidpid;
  }

  warn "$0 [$$]: returning @results" if $DEBUG;
  @results;
}

使用示例主程序

my @jobs = qw(
  bogus
  http://stackoverflow.com/
  http://www.google.com/
  http://www.yahoo.com/
);

my @results = run_pool $NJOBS, @jobs;
print $_, "\n" for @results;

输出是

bogus: NO http://www.google.com/: YES http://stackoverflow.com/: YES http://www.yahoo.com/: YES 假的：没有 http://www.google.com/: 对 http://stackoverflow.com/: 对

http://www.yahoo.com/: 是基于工作者的并行化（使用您选择的线程或进程）：

另一个选项是HTTP:：Async

#!/usr/bin/perl
use strict;
use warnings;

use HTTP::Request;
use HTTP::Async;

my $numberAlive = 0;
my $numberDead  = 0;
my @urls = ('http://www.perl.com','http://www.example.xyzzy/foo.html');

my $async = HTTP::Async->new;

# you might want to wrap this in a loop to deal with @urls in batches
foreach my $url (@urls){   
  $async->add( HTTP::Request->new( HEAD => $url ) );
  }

while ( my $response = $async->wait_for_next_response ) {
  if ($response->code == 200){$numberAlive ++;}
  else{$numberDead ++;}
  }

print "$numberAlive Alive, $numberDead Dead\n";

我强烈建议您使用而不是自己处理流程池。复制自己的帖子

use strict;
use warnings;
use feature qw( say );
use threads;  # or: use forks;

use LWP::Simple        qw( head );
use Thread::Queue::Any qw( );

use constant NUM_WORKERS => 10;  # Or whatever.

my $req_q  = Thread::Queue::Any->new();
my $resp_q = Thread::Queue::Any->new();

my @workers;
for (1..NUM_WORKERS) {
   push @workers, async {
      while (my $url = $req_q->dequeue()) {
         my $is_alive = head($url) ? 1 : 0;
         $resp_q->enqueue($is_alive);
      }
   };
}

$req_q->enqueue($_) for @urls;

my ($alive, $dead);
for (1..@urls) {
   my $is_alive = $resp_q->dequeue();
   ++( $is_alive ? $alive : $dead );
}

$req_q->enqueue(undef) for @workers;
$_->join for @workers;

say $alive;
say $dead;

#!/usr/bin/perl
use strict;
use warnings;

use HTTP::Request;
use HTTP::Async;

my $numberAlive = 0;
my $numberDead  = 0;
my @urls = ('http://www.perl.com','http://www.example.xyzzy/foo.html');

my $async = HTTP::Async->new;

# you might want to wrap this in a loop to deal with @urls in batches
foreach my $url (@urls){   
  $async->add( HTTP::Request->new( HEAD => $url ) );
  }

while ( my $response = $async->wait_for_next_response ) {
  if ($response->code == 200){$numberAlive ++;}
  else{$numberDead ++;}
  }

print "$numberAlive Alive, $numberDead Dead\n";