用于wiki的php爬虫获取错误_Php_Parsing_Web Crawler

用于wiki的php爬虫获取错误

php parsing web-crawler

用于wiki的php爬虫获取错误,php,parsing,web-crawler,Php,Parsing,Web Crawler,在下面的代码中，我试图使用php代码从网站中提取内容，当我使用getElementByIdAsString（'www.abebooks.com/978014318764/Love-Story-Singh-Ravinder-014348769/plp'，'synopsis'）时，该代码运行良好但当我使用相同的代码从维基百科中提取内容时，它就不起作用了，getElementByIdAsString（“”，’Summary’）下面是我的代码和我使用后一个代码时遇到的异常。有人可以更正我的代码以根据

在下面的代码中，我试图使用php代码从网站中提取内容，当我使用getElementByIdAsString（'www.abebooks.com/978014318764/Love-Story-Singh-Ravinder-014348769/plp'，'synopsis'）时，该代码运行良好

但当我使用相同的代码从维基百科中提取内容时，它就不起作用了，getElementByIdAsString（“”，’Summary’）

下面是我的代码和我使用后一个代码时遇到的异常。有人可以更正我的代码以根据id提取维基百科内容吗

提前谢谢

<?php


function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $result = curl_exec($ch);


//    var_dump($doc->loadHTMLFile($url)); die;
error_reporting(E_ERROR | E_PARSE);
    if(!$result) {
        throw new Exception("Failed to load $url");
    }
    $doc->loadHTML($result);
    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

//Here I am dispalying the output in bold text
echo getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary');

?>

您的帮助将非常有用：-）

尝试添加：

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

评论中讨论后更新：

<?php

function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    $result = curl_exec($ch);

    error_reporting(E_ERROR | E_PARSE);
    if(!$result) {
        throw new Exception("Failed to load $url");
    }
    $doc->loadHTML($result);
    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    $output = '';
    $node = $element->parentNode;

    while(true) {
        $node = $node->nextSibling;
        if(!$node) {
            break;
        }
        if($node->nodeName == 'p') {
            $output .= $node->nodeValue;
        }
        if($node->nodeName == 'h2') {
            break;
        }
    }

    return $output;
}

//Here I am dispalying the output in bold text
var_dump(getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary'));

尝试添加：
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

评论中讨论后更新：
<?php

function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    $result = curl_exec($ch);

    error_reporting(E_ERROR | E_PARSE);
    if(!$result) {
        throw new Exception("Failed to load $url");
    }
    $doc->loadHTML($result);
    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    $output = '';
    $node = $element->parentNode;

    while(true) {
        $node = $node->nextSibling;
        if(!$node) {
            break;
        }
        if($node->nodeName == 'p') {
            $output .= $node->nodeValue;
        }
        if($node->nodeName == 'h2') {
            break;
        }
    }

    return $output;
}

//Here I am dispalying the output in bold text
var_dump(getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary'));

Liszka这次它没有返回任何错误，但我得到的是一个没有任何内容的空白页..当我运行这段代码时，是否有任何方法可以在不使用ID的情况下拉出特定内容，从而获得“摘要”，因此基本上我认为它可以正确工作，因为您使用的函数是getElementById（因此与使用$（“#摘要”）的效果相同）在chrome控制台中。你想实现什么？也许可以尝试var_转储输出，除了回显它？var_转储（getElementByIdAsString（''，Summary'）；我只是想在Summary选项卡下提取文本。很棒的家伙：-）非常感谢：-）没问题，我很高兴：-）Liszka这次没有返回任何错误，但是我得到的是一个没有任何内容的空白页。当我运行这段代码时，是否可以在不使用id的情况下提取特定内容，从而获得“Summary”，因此基本上，我认为当您将函数用作getElementById时，它工作正常（因此与use$（“#Summary”）的效果相同）在chrome控制台中。你想实现什么？也许可以尝试var_转储输出，除了回显它？var_转储（getElementByIdAsString（''，Summary'）；我只是想在Summary选项卡下提取文本。很棒的家伙：-）非常感谢：-）没问题，我的荣幸：）