Php 正在尝试将DOMDocument::loadHTMLFile与生成的url一起使用

Php 正在尝试将DOMDocument::loadHTMLFile与生成的url一起使用,php,domdocument,Php,Domdocument,我正在使用我构建的url调用DOMDocument::loadHTMLFile方法 这是我用来构建url的代码: $url = "http://en.wikipedia.org".$path $path是从另一个文件的href属性获得的。当我回显时,它返回/wiki/Pop_music 如果我将url硬编码为http://en.wikipedia.org/wiki/Pop_music页面返回正常,但如果我尝试使用生成的路径,就会出现错误 这是我目前正在使用的代码: foreach ($

我正在使用我构建的url调用DOMDocument::loadHTMLFile方法

这是我用来构建url的代码:

$url = "http://en.wikipedia.org".$path
$path
是从另一个文件的href属性获得的。当我回显时,它返回
/wiki/Pop_music

如果我将url硬编码为
http://en.wikipedia.org/wiki/Pop_music
页面返回正常,但如果我尝试使用生成的路径,就会出现错误

这是我目前正在使用的代码:

    foreach ($paths as $path) 
    {
        echo $path;                                         // will cause error 
        //echo $path = '/wiki/Pop_music';                       // will work
        $url = "http://en.wikipedia.org"."$path";
        $doc = getHTML($url, 1);

        if($doc !== false)
        {
            $xpath = new DOMXPath($doc);
            $xpathCode = "//h1[@id='firstHeading']";
            $nodes = $xpath->query($xpathCode);
            echo $nodes->item(0)->nodeValue."<br />";
        }
    }
set_error_handler(function($errno, $errstr, $errfile, $errline) {
    //Digest error here
});

$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);

restore_error_handler();
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
    $xpath = new DOMXPath($doc);
}
foreach($path作为$path)
{
echo$path;//将导致错误
//echo$path='/wiki/Pop_music';//将起作用
$url=”http://en.wikipedia.org“$path”;
$doc=getHTML($url,1);
如果($doc!==false)
{
$xpath=新的DOMXPath($doc);
$xpathCode=“//h1[@id='firstHeading']”;
$nodes=$xpath->query($xpathCode);
echo$nodes->item(0)->nodeValue。“
”; } }
getHTML函数是:

function getHTML($url, $domainID)
{
    $conArtistsCrawler = new mysqli(HOST, USERNAME, PASSWORD, CRAWLER_DB_NAME);

    // Load HTML
    $doc = new DOMDocument();
    $isSuccessful = $doc->loadHTMLFile($url);

    // Update the time to show that the domain was crawled.
    $sql = "UPDATE Domain SET LastCrawled = CURRENT_TIMESTAMP() WHERE DomainID = '$domainID'";
    $conArtistsCrawler->query($sql);
    $conArtistsCrawler->close();

    // Delay 1 second after the request to avoid getting BANNED
    sleep(1);

    // Check to see if URL is valid
    if($isSuccessful === false)
    {
        //URL invalid!
        echo "\"".$url."\" is invalid<br>";
        return false;
    }

    return $doc;
}
函数getHTML($url,$domainID) { $conArtistsCrawler=新的mysqli(主机、用户名、密码、爬虫程序\u DB\u名称); //加载HTML $doc=新的DOMDocument(); $isSuccessful=$doc->loadHTMLFile($url); //更新时间以显示域已爬网。 $sql=“UPDATE Domain SET lastclawled=CURRENT_TIMESTAMP(),其中DomainID='$DomainID'; $conArtistsCrawler->query($sql); $conartist scrawler->close(); //请求后延迟1秒以避免被禁止 睡眠(1); //检查URL是否有效 如果($isSuccessful==false) { //URL无效! 回送“\”.$url.“\”无效
”; 返回false; } 返回$doc; } 代码输出:

    foreach ($paths as $path) 
    {
        echo $path;                                         // will cause error 
        //echo $path = '/wiki/Pop_music';                       // will work
        $url = "http://en.wikipedia.org"."$path";
        $doc = getHTML($url, 1);

        if($doc !== false)
        {
            $xpath = new DOMXPath($doc);
            $xpathCode = "//h1[@id='firstHeading']";
            $nodes = $xpath->query($xpathCode);
            echo $nodes->item(0)->nodeValue."<br />";
        }
    }
set_error_handler(function($errno, $errstr, $errfile, $errline) {
    //Digest error here
});

$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);

restore_error_handler();
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
    $xpath = new DOMXPath($doc);
}
带有硬编码路径:

    foreach ($paths as $path) 
    {
        echo $path;                                         // will cause error 
        //echo $path = '/wiki/Pop_music';                       // will work
        $url = "http://en.wikipedia.org"."$path";
        $doc = getHTML($url, 1);

        if($doc !== false)
        {
            $xpath = new DOMXPath($doc);
            $xpathCode = "//h1[@id='firstHeading']";
            $nodes = $xpath->query($xpathCode);
            echo $nodes->item(0)->nodeValue."<br />";
        }
    }
set_error_handler(function($errno, $errstr, $errfile, $errline) {
    //Digest error here
});

$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);

restore_error_handler();
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
    $xpath = new DOMXPath($doc);
}
警告:DOMDocument::loadHTMLFile():ID已受保护的图标 定义于,第行:60 in /第77行的Applications/MAMP/htdocs/Assignments/Assignment4/test.php /wiki/Pop_音乐警告:DOMDocument::loadHTMLFile():标记音频 输入无效,第225行输入 /第77行的Applications/MAMP/htdocs/Assignments/Assignment4/test.php

警告:DOMDocument::loadHTMLFile():中的标记源无效 ,第行:225英寸 /第77行的Applications/MAMP/htdocs/Assignments/Assignment4/test.php 流行音乐

带有路径变量:

    foreach ($paths as $path) 
    {
        echo $path;                                         // will cause error 
        //echo $path = '/wiki/Pop_music';                       // will work
        $url = "http://en.wikipedia.org"."$path";
        $doc = getHTML($url, 1);

        if($doc !== false)
        {
            $xpath = new DOMXPath($doc);
            $xpathCode = "//h1[@id='firstHeading']";
            $nodes = $xpath->query($xpathCode);
            echo $nodes->item(0)->nodeValue."<br />";
        }
    }
set_error_handler(function($errno, $errstr, $errfile, $errline) {
    //Digest error here
});

$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);

restore_error_handler();
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
    $xpath = new DOMXPath($doc);
}
警告:DOMDocument::loadHTMLFile():中已定义ID保护图标 ,行:60英寸 /第77行的Applications/MAMP/htdocs/Assignments/Assignment4/test.php /维基/流行音乐

警告: DOMDocument::loadHTMLFile(): 无法打开流:HTTP请求失败!HTTP/1.1400错误请求 在线输入/Applications/MAMP/htdocs/Assignments/Assignment4/test.php 77

警告:DOMDocument::loadHTMLFile():I/O警告:加载失败 外部实体“” 在线输入/Applications/MAMP/htdocs/Assignments/Assignment4/test.php “77”无效

简短答复: 嗯,您得到的错误是由于
$doc
不是DOMDocument对象,而是布尔值false。由于您正在抑制DOMDocument警告,因此无法知道为什么
getHTML()
返回false

所以,失去@operator,检查DOMDocument抱怨什么 然后从那里进行调试。

编辑:

    foreach ($paths as $path) 
    {
        echo $path;                                         // will cause error 
        //echo $path = '/wiki/Pop_music';                       // will work
        $url = "http://en.wikipedia.org"."$path";
        $doc = getHTML($url, 1);

        if($doc !== false)
        {
            $xpath = new DOMXPath($doc);
            $xpathCode = "//h1[@id='firstHeading']";
            $nodes = $xpath->query($xpathCode);
            echo $nodes->item(0)->nodeValue."<br />";
        }
    }
set_error_handler(function($errno, $errstr, $errfile, $errline) {
    //Digest error here
});

$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);

restore_error_handler();
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
    $xpath = new DOMXPath($doc);
}
但我仍然不确定,为什么当我传入变量时,会得到一个 当我硬编码时,结果就不同了。当我回显两个路径值时 或者url值,它们看起来完全相同

它们当然不完全相同。流行音乐后有一个

标记,这使得url无效。 长话短说 运行此脚本:

$path = '/wiki/Pop_music';
$url = "http://en.wikipedia.org$path";
$doc = new \DOMDocument();
$success = @$doc->loadHTMLFile($url);

if ($success) {
    $xpath = new DOMXPath($doc);
    $xpathCode = "//h1[@id='firstHeading']";
    $nodes = $xpath->query($xpathCode);
    echo $nodes->item(0)->nodeValue."<br />";
}
注意:您也可以使用

getHTML函数返回的类型不一致 getHTML函数可以返回DOMDocument对象布尔值。虽然这本身并不是一件坏事(在内部,PHP通过许多函数实现了这一点),但这意味着您不能假设
$doc
是一个对象,因为它可以是布尔值false。因此,在将返回值作为参数传递给XDOMPath之前,必须对其进行测试。事实上,这就是你所得到的错误:

您正在将布尔值传递给XDOMPath,而不是DOMDocument对象 结伴

要么在函数中抛出异常(或错误),要么在传递给XDOMPath之前测试返回值

示例:

    foreach ($paths as $path) 
    {
        echo $path;                                         // will cause error 
        //echo $path = '/wiki/Pop_music';                       // will work
        $url = "http://en.wikipedia.org"."$path";
        $doc = getHTML($url, 1);

        if($doc !== false)
        {
            $xpath = new DOMXPath($doc);
            $xpathCode = "//h1[@id='firstHeading']";
            $nodes = $xpath->query($xpathCode);
            echo $nodes->item(0)->nodeValue."<br />";
        }
    }
set_error_handler(function($errno, $errstr, $errfile, $errline) {
    //Digest error here
});

$doc = new DOMDocument();
$isSuccessful = $doc->loadHTMLFile($url);

restore_error_handler();
$doc = getHTML($url, 1);
if ($doc instanceof \DOMDocument) {
    $xpath = new DOMXPath($doc);
}

您应该在
@$doc->loadHTMLFile($url)之前删除
@
并使用正确的ErrorHandler我取出
@
并得到一些错误消息,但我仍然不确定为什么传入变量时得到的结果与硬编码时不同。当我回显路径值或url值时,它们看起来是相同的。弹出音乐后有一个

标记,以查看详细的响应。我完全忘记了我以前添加的
@
。我删除了
@
,但是我传入的变量仍然有问题。嘿,谢谢!我刚修好!我在path变量中附加了一个换行标记。@Tony yeah=)一个小错误,导致调试困难。这就是为什么@operator是邪恶的=P