Php 使用文档提取href_Php_Web Scraping_Domdocument

Php 使用文档提取href

php web-scraping

Php 使用文档提取href,php,web-scraping,domdocument,Php,Web Scraping,Domdocument,我尝试使用php的domDocument从url中提取href，示例url为： trovaprezzi.it/categoria.aspx？id=-1&libera=frigorifero+lg 我想提取的URL是“Frigoriferi e Congelatori”中的URL，这是我的代码草图：我应该提取这个链接：“trovaprezzi.it/prezzo_frigoriferi-congelatori_frigorifero_lg.aspx”；来自源代码$url，但链接发生了变化，例如在

我尝试使用php的domDocument从url中提取href，示例url为：

trovaprezzi.it/categoria.aspx？id=-1&libera=frigorifero+lg

我想提取的URL是“Frigoriferi e Congelatori”中的URL，这是我的代码草图：

我应该提取这个链接：“trovaprezzi.it/prezzo_frigoriferi-congelatori_frigorifero_lg.aspx”；来自源代码$url，但链接发生了变化，例如在这个页面中'trovaprezzi.it/categoria.aspx？id=-1&libera=lavatrice+lg'‌; 我需要提取第一个链接：“trovaprezzi.it/prezzo_lavatrici-asciugatrici_lavatrice_lg.aspx”

$url = 'http://www.trovaprezzi.it/categoria.aspx?id=-1&libera=frigorifero+lg';
$html = file_get_contents($url);
$dom = new DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('/html/body/div[@class="catsMI"]/div')->getElementsByTagName('a')->item(0)->getAttribute('href')  ;
echo $nodes;

提前谢谢你的帮助

更新23/06

要提取链接的代码示例：

<div class="catsMI">
        <div><a title="confronta i prezzi Frigoriferi e Congelatori" href="/prezzo_frigoriferi-congelatori_frigorifero_lg.aspx">Frigoriferi e Congelatori</a><span>(732 prezzi)</span></div>
        <div><a title="confronta i prezzi Ricambi Elettrodomestici" href="/prezzo_ricambi-elettrodomestici_frigorifero_lg.aspx">Ricambi Elettrodomestici</a><span>(191 prezzi)</span></div>
</div>

返回一个没有方法

getElementsByTagName（）

的。只有这样，才能有那种方法

我还没有看过你给出的页面（请在你的问题中添加一个最简单可行的网站HTML示例），但请尝试以下方法：

// search for all <a> elements that have a href attribute
// which are descendants of //div[@class="catsMI"]/div
$nodes = $xpath->query( '//div[@class="catsMI"]/div//a[@href]' );

// check if we found any nodes...
if( $nodes->length > 0 ) {
   // if we did: get href attribute of the first node we found
   $href = $nodes->item( 0 )->getAttribute( 'href' );
   echo $href;
}

//搜索具有href属性的所有元素
//它们是//div[@class=“catsMI”]/div的后代
$nodes=$xpath->query（'//div[@class=“catsMI”]/div//a[@href]'）；
//检查是否找到任何节点。。。
如果（$nodes->length>0）{
//如果找到了：获取找到的第一个节点的href属性
$href=$nodes->item（0）->getAttribute（'href'）；
echo$href；
}

您的问题不清楚，您到底在寻求什么帮助？我应该从源代码$url中提取此链接：''，但是链接更改，例如在该页面''中，我需要提取第一个链接：''。不要告诉我，请将其添加到问题中谢谢您的帮助，不幸的是它仍然不起作用，我在第一篇文章中添加了一个HTML示例@JustLazzah好的，我改变了问题。现在再试一次。如果仍然不起作用，则通过查看

$dom->saveHTML（）的输出来测试HTML是否实际加载到DOMDocument
。如果这工作正常，那么我唯一能想到的是catsMI
不是class属性的唯一精确值。在这种情况下，你看，关于这个问题。@JustLazzah好的，很高兴听到你的话，不客气。祝你好运
// search for all <a> elements that have a href attribute
// which are descendants of //div[@class="catsMI"]/div
$nodes = $xpath->query( '//div[@class="catsMI"]/div//a[@href]' );

// check if we found any nodes...
if( $nodes->length > 0 ) {
   // if we did: get href attribute of the first node we found
   $href = $nodes->item( 0 )->getAttribute( 'href' );
   echo $href;
}