Perl WWW::Mechanize不处理撇号或破折号

Perl WWW::Mechanize不处理撇号或破折号,perl,unicode,utf-8,www-mechanize,Perl,Unicode,Utf 8,Www Mechanize,我一直在努力从Metacritic中提取信息,但现在遇到了一个问题,即无法提取带有撇号或破折号的干净文本 以下代码说明了此问题: use WWW::Mechanize; $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews'; $Review = 'In the end Death triumphs, but its allure and obsession remain a my

我一直在努力从Metacritic中提取信息,但现在遇到了一个问题,即无法提取带有撇号或破折号的干净文本

以下代码说明了此问题:

use WWW::Mechanize;
 $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
 $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
 $l = WWW::Mechanize->new();
    $l->get($reviewspage);
    $k = $l->content;
    @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s;
    print "@Review\n";
尽管网站上的编码是:

<div class="review_body">
                                Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
                            </div>

太多的博士认为我们的品味是理所当然的;艾丽丝·库珀、亨利·罗林斯和其他人不会让你相信死亡可能是巨大的,陈词滥调的最后一幕重逢节目也不会。但影片的交替探究——家庭之爱、缓慢的妥协和死亡——引起强烈共鸣。

我以前创建过类似的脚本,它们都使用了WWW::Mechanize,但没有一个替换掉这样的字符。

Metacritic使用utf8字符集:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">
输出(添加新行):


显然这是一个unicode问题

根据中的建议,我能够使您的代码的此版本正常工作:

use v5.12 ;
use utf8 ;
use open qw( :encoding(UTF-8) :std ) ;

use WWW::Mechanize ;
 my $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews' ;
 my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.' ;
 my $l = WWW::Mechanize->new() ;
    $l->get($reviewspage) ;   
    my $k = $l->content ;   
    my @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s ;
    print "@Review\n" ;
use strict;
use warnings;
use utf8;

use WWW::Mechanize;

binmode STDOUT, ':utf8';   # output should be in UTF-8

my $url = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';

my $lwp = WWW::Mechanize->new();
$lwp->get($url);
my $data = $lwp->content;

if ($data =~ m{$Review.*?<div class="review_body">(.*?)</div>}s) {
    print "$1\n";
} else {
    warn "Review not found";
}
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins
and others won’t persuade you that Death could have been huge, nor does a
clichéd last-act reunion show. But the film’s alternating inquiry — into
family love, slow compromise and, yes, death — resonates strongly.
use v5.12 ;
use utf8 ;
use open qw( :encoding(UTF-8) :std ) ;

use WWW::Mechanize ;
 my $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews' ;
 my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.' ;
 my $l = WWW::Mechanize->new() ;
    $l->get($reviewspage) ;   
    my $k = $l->content ;   
    my @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s ;
    print "@Review\n" ;
                                    Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.