Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/perl/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Perl 为什么HTML::TreeBuilder在输出中显示mojibake/奇怪字符?_Perl_Mojibake - Fatal编程技术网

Perl 为什么HTML::TreeBuilder在输出中显示mojibake/奇怪字符?

Perl 为什么HTML::TreeBuilder在输出中显示mojibake/奇怪字符?,perl,mojibake,Perl,Mojibake,我有一个问题;它在输出中显示mojibake/wird字符 use strict; use WWW::Curl::Easy; use HTML::TreeBuilder; my $cookie_file ='/tmp/pcook'; my $curl = new WWW::Curl::Easy; my $response_body; my $charset = 'utf-8'; $DocOffline::charset = undef; $curl->setopt (CURLOPT_URL

我有一个问题;它在输出中显示mojibake/wird字符

use strict;
use WWW::Curl::Easy;
use HTML::TreeBuilder;
my $cookie_file ='/tmp/pcook';
my $curl = new WWW::Curl::Easy;
my $response_body;
my $charset = 'utf-8';
$DocOffline::charset = undef;
$curl->setopt (CURLOPT_URL, 'http://www.breitbart.com/article.php?id=D9G7CR5O0&show_article=1');
$curl->setopt ( CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.9 (KHTML, like Gecko) Chrome/6.0.400.0 Safari/533.9');
$curl->setopt ( CURLOPT_HEADER, 0);
$curl->setopt ( CURLOPT_FOLLOWLOCATION, 1);
$curl->setopt ( CURLOPT_AUTOREFERER, 1);
$curl->setopt ( CURLOPT_SSL_VERIFYPEER, 0);
$curl->setopt ( CURLOPT_COOKIEFILE, $cookie_file);
$curl->setopt ( CURLOPT_COOKIEJAR, $cookie_file);
$curl->setopt ( CURLOPT_HEADERFUNCTION, \&headerCallback );
open (my $fileb, ">", \$response_body);
$curl->setopt(CURLOPT_WRITEDATA,$fileb);
my $retcode = $curl->perform;
if ($retcode == 0) {
    my $dom_tree = HTML::TreeBuilder->new();
    $dom_tree->ignore_elements(qw(script style));
    $dom_tree->utf8_mode(1);
    $dom_tree->parse($response_body);
    $dom_tree->eof();
    print $dom_tree->as_HTML('<>&', ' ', {});
}
sub headerCallback {
my($data, $pointer) = @_;
$data =~ m/Content-Type:\s*.*;\s*charset=(.*)/;
if ($1) {
    $charset =  $1;
    $charset =~ s/[^a-zA-Z0-9_\-]*//g;
}
return length($data);
}
使用严格;
使用WWW::Curl::Easy;
使用HTML::TreeBuilder;
我的$cookie_文件='/tmp/pcook';
my$curl=new WWW::curl::Easy;
我的身体;
my$charset='utf-8';
$DocOffline::charset=undef;
$curl->setopt(CURLOPT_URL,'http://www.breitbart.com/article.php?id=D9G7CR5O0&show_article=1');
$curl->setopt(CURLOPT_USERAGENT,'Mozilla/5.0(Windows;U;windowsnt 6.1;en-US)AppleWebKit/533.9(KHTML,比如Gecko)Chrome/6.0.400.0 Safari/533.9');
$curl->setopt(CURLOPT_头,0);
$curl->setopt(CURLOPT_FOLLOWLOCATION,1);
$curl->setopt(CURLOPT_AUTOREFERER,1);
$curl->setopt(CURLOPT\u SSL\u VERIFYPEER,0);
$curl->setopt(CURLOPT\u COOKIEFILE,$cookie\u file);
$curl->setopt(CURLOPT\u COOKIEJAR,$cookie\u文件);
$curl->setopt(CURLOPT\u HEADERFUNCTION,\&headerCallback);
打开(我的$fileb,“>”,\$response\u body);
$curl->setopt(CURLOPT_WRITEDATA,$fileb);
my$retcode=$curl->perform;
如果($retcode==0){
我的$dom_tree=HTML::TreeBuilder->new();
$dom_tree->ignore_元素(qw(脚本样式));
$dom_tree->utf8_模式(1);
$dom\u tree->parse($response\u body);
$dom_tree->eof();
打印$dom_tree->as_HTML('&','',{});
}
副标题回拨{
我的($data,$pointer)=@;
$data=~m/内容类型:\s**;\s*字符集=(.*)/;
如果有的话(1美元){
$charset=$1;
$charset=~s/[^a-zA-Z0-9\-]*//g;
}
返回长度($数据);
}

你一整天都没有得到答案,因为你的代码在形式和内容上都很混乱,你甚至懒得把整个程序简化成一个测试用例。Mvangest在问题的评论中也产生了误诊

问题是编写Breitbart CMS的人毫无头绪,他们插入了NCR
(这是一个不可打印的字符,甚至可能是无效字符),而他们本应简单地插入字符
-
U+2014 EM DASH
);毕竟,文档编码被声明为UTF-8。(可以清楚地看到,编码应该是Windows-1252,其中分配了代码点151(十进制)

你可以通过明确的解码/编码步骤来解决他们的无能

use Encode qw(encode decode);
⋮
my $string_representation = $dom_tree->as_HTML('<>&', ' ', {});
my $octets = encode('UTF-8', decode('Windows-1252', $string_representation);
⋮
# send the correct Content-Type header in your CGI program before printing the HTTP body
print $octets;
使用编码qw(Encode-decode);
⋮
我的$string_表示=$dom_tree->as_HTML('&','',{});
my$octets=encode('UTF-8',decode('Windows-1252',$string_表示法);
⋮
#在打印HTTP正文之前,在CGI程序中发送正确的内容类型标头
打印$octets;

您正在打印DOM树,但您的终端可能不支持UTF-8。请尝试将其写入文件,然后使用首先正确显示页面的浏览器读取。我尝试将cgi打印到浏览器,结果相同