用于wiki的php爬虫获取错误
在下面的代码中,我试图使用php代码从网站中提取内容,当我使用getElementByIdAsString('www.abebooks.com/978014318764/Love-Story-Singh-Ravinder-014348769/plp','synopsis')时,该代码运行良好 但当我使用相同的代码从维基百科中提取内容时,它就不起作用了,getElementByIdAsString(“”,’Summary’) 下面是我的代码和我使用后一个代码时遇到的异常。有人可以更正我的代码以根据id提取维基百科内容吗 提前谢谢用于wiki的php爬虫获取错误,php,parsing,web-crawler,Php,Parsing,Web Crawler,在下面的代码中,我试图使用php代码从网站中提取内容,当我使用getElementByIdAsString('www.abebooks.com/978014318764/Love-Story-Singh-Ravinder-014348769/plp','synopsis')时,该代码运行良好 但当我使用相同的代码从维基百科中提取内容时,它就不起作用了,getElementByIdAsString(“”,’Summary’) 下面是我的代码和我使用后一个代码时遇到的异常。有人可以更正我的代码以根据
<?php
function getElementByIdAsString($url, $id, $pretty = true) {
$doc = new DOMDocument();
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
// var_dump($doc->loadHTMLFile($url)); die;
error_reporting(E_ERROR | E_PARSE);
if(!$result) {
throw new Exception("Failed to load $url");
}
$doc->loadHTML($result);
// Obtain the element
$element = $doc->getElementById($id);
if(!$element) {
throw new Exception("An element with id $id was not found");
}
if($pretty) {
$doc->formatOutput = true;
}
// Return the string representation of the element
return $doc->saveXML($element);
}
//Here I am dispalying the output in bold text
echo getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary');
?>
您的帮助将非常有用:-)尝试添加:
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
评论中讨论后更新:
<?php
function getElementByIdAsString($url, $id, $pretty = true) {
$doc = new DOMDocument();
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($ch);
error_reporting(E_ERROR | E_PARSE);
if(!$result) {
throw new Exception("Failed to load $url");
}
$doc->loadHTML($result);
// Obtain the element
$element = $doc->getElementById($id);
if(!$element) {
throw new Exception("An element with id $id was not found");
}
if($pretty) {
$doc->formatOutput = true;
}
$output = '';
$node = $element->parentNode;
while(true) {
$node = $node->nextSibling;
if(!$node) {
break;
}
if($node->nodeName == 'p') {
$output .= $node->nodeValue;
}
if($node->nodeName == 'h2') {
break;
}
}
return $output;
}
//Here I am dispalying the output in bold text
var_dump(getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary'));
尝试添加:
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
评论中讨论后更新:
<?php
function getElementByIdAsString($url, $id, $pretty = true) {
$doc = new DOMDocument();
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($ch);
error_reporting(E_ERROR | E_PARSE);
if(!$result) {
throw new Exception("Failed to load $url");
}
$doc->loadHTML($result);
// Obtain the element
$element = $doc->getElementById($id);
if(!$element) {
throw new Exception("An element with id $id was not found");
}
if($pretty) {
$doc->formatOutput = true;
}
$output = '';
$node = $element->parentNode;
while(true) {
$node = $node->nextSibling;
if(!$node) {
break;
}
if($node->nodeName == 'p') {
$output .= $node->nodeValue;
}
if($node->nodeName == 'h2') {
break;
}
}
return $output;
}
//Here I am dispalying the output in bold text
var_dump(getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary'));
Liszka这次它没有返回任何错误,但我得到的是一个没有任何内容的空白页..当我运行这段代码时,是否有任何方法可以在不使用ID的情况下拉出特定内容,从而获得“摘要”,因此基本上我认为它可以正确工作,因为您使用的函数是getElementById(因此与使用$(“#摘要”)的效果相同)在chrome控制台中。你想实现什么?也许可以尝试var_转储输出,除了回显它?var_转储(getElementByIdAsString('',Summary');我只是想在Summary选项卡下提取文本。很棒的家伙:-)非常感谢:-)没问题,我很高兴:-)Liszka这次没有返回任何错误,但是我得到的是一个没有任何内容的空白页。当我运行这段代码时,是否可以在不使用id的情况下提取特定内容,从而获得“Summary”,因此基本上,我认为当您将函数用作getElementById时,它工作正常(因此与use$(“#Summary”)的效果相同)在chrome控制台中。你想实现什么?也许可以尝试var_转储输出,除了回显它?var_转储(getElementByIdAsString('',Summary');我只是想在Summary选项卡下提取文本。很棒的家伙:-)非常感谢:-)没问题,我的荣幸:)