Php DOMXPath->；评估找不到我需要的div_Php_Curl_Web Scraping_Domxpath

Php DOMXPath->；评估找不到我需要的div

php curl web-scraping

Php DOMXPath->；评估找不到我需要的div,php,curl,web-scraping,domxpath,Php,Curl,Web Scraping,Domxpath,大家好我正在努力拼凑结果，并取得了成功，但我现在陷入了困境下面的代码显示有一个类为“vsc”的DIV，里面是一个类为“r”的H3。我可以使用（//H3[@class='r'//a）获取H3标记内部的锚我的问题是，下面的表也有一个H3和一个'r'类，我不想在表内的任何链接 <li class="g"> <div class="vsc" pved="0CD4QkgowAA" bved="0CD8QkQo" sig="m15"> <h3 class="r">

大家好

我正在努力拼凑结果，并取得了成功，但我现在陷入了困境

下面的代码显示有一个类为“vsc”的DIV，里面是一个类为“r”的H3。我可以使用（//H3[@class='r'//a）获取H3标记内部的锚

我的问题是，下面的表也有一个H3和一个'r'类，我不想在表内的任何链接

<li class="g">
<div class="vsc" pved="0CD4QkgowAA" bved="0CD8QkQo" sig="m15">
<h3 class="r">
<a href="https://ameriloan.com/" class="l" onmousedown="return          rwt(this,'','','','1','AFQjCNEazKuyTuAyYgnAT3MqI3aJoiAlZw','','0CDwQFjAA',null,event)">
</h3>
<div class="vspib" aria-label="Result details" role="button" tabindex="0">
<div class="s">
</div>
<table cellpadding="0" cellspacing="0" class="nrgt">

下面是我用来刮除所有锚的脚本，但仅检索“vsc”DIV中的H3锚不起作用：

function getURL($url)


{
$ch=curl_init();
// This allows the script to accept HTTPS certificates "blindly"
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_HTTP_VERSION,'CURL_HTTP_VERSION_1_1' );
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Follows redirects
curl_setopt($ch, CURLOPT_MAXREDIRS, 6);  // follows up to 6 redirects
$ret = curl_exec($ch);
return $ret;
}
$i = 0;
$rawKeyword = 'EXAMPLE';
$keyword = str_replace(' ', '+', $rawKeyword);

$url = "http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=".$keyword;

//get the HTML through cURL function
$html = getURL($url);

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all data
$xpath = new DOMXPath($dom);

// XPath eval to get page links and titles 
//$elementContent = $xpath->evaluate("//h3[@class='r']//a");
$elementContent = $xpath->evaluate("//div[@class='vsc']//h3[@class='r']//a");


// Print results
foreach ($elementContent as $content) {
  $i++;
  $clean = trim($content->getAttribute('href'), "/url?q=");
  echo '<strong>'.$i.'</strong>: <h3 style=" clear:none !important; font-size:10px; letter-spacing:0.1em; line-height:2.6em; text-transform:uppercase;">'.$content->textContent.'</h3><br/>'.$clean.'<br /><br />';
}

函数getURL（$url） { $ch=curl_init（）； //这允许脚本“盲目”接受HTTPS证书 curl_setopt（$ch，CURLOPT_SSL_VERIFYPEER，false）； curl_setopt（$ch，CURLOPT_URL，$URL）； curl_setopt（$ch，CURLOPT_HTTP_VERSION，'curl_HTTP_VERSION_1_1'）； curl_setopt（$ch，CURLOPT_RETURNTRANSFER，1）； curl_setopt（$ch，CURLOPT_FOLLOWLOCATION，true）；//遵循重定向 curl_setopt（$ch，CURLOPT_MAXREDIRS，6）；//最多执行6个重定向 $ret=curl\u exec（$ch）；返回$ret； } $i=0； $rawKeyword='EXAMPLE'； $keyword=str_replace（“”，“+”，$rawKeyword）； $url=”http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=“.$关键字； //通过cURL函数获取HTML $html=getURL（$url）； //将html解析为文档 $dom=新的DOMDocument（）； @$dom->loadHTML（$html）； //抓取所有数据 $xpath=newdomxpath（$dom）； //XPath eval获取页面链接和标题 //$elementContent=$xpath->evaluate（“//h3[@class='r']//a”）； $elementContent=$xpath->evaluate（“//div[@class='vsc']//h3[@class='r']//a”）； //打印结果 foreach（$elementContent作为$content）{ $i++； $clean=trim（$content->getAttribute（'href'），“/url？q=”）；回显“”.$i.：“.$content->textContent.”
“.$clean.”

”； } 我的评估查询出了什么问题

@jdwilemo- 你是对的，我只是想得到一个类为“vsc”的DIV内部的锚。下面是更多的表代码，其中显示了另一个类为“r”的H3 DIV

<table cellpadding="0" cellspacing="0" class="nrgt">
<tbody>
<tr class="mslg">
<td style="vertical-align: top; ">
<div class="sld vsc" pved="0CIYBEJIKMAE" bved="0CIcBEJEK" sig="Q_U">
<span class="tl">
<h3 class="r">
<a href="https://example.com/?page=ent_cs_login" class="l" onmousedown="return rwt(this,'','','','2','AFQjCNEyANjoolNXGFnLVKH3S1j4CO1qQw','','0CIQBEIwQMAE',null,event)">
</h3>
</span>
<div class="vspib" aria-label="Result details" role="button" tabindex="0">
<div class="s">
</div>
</li>

所有内容都被包装在一个'li'标记中。表是'li'标记中的最后一个元素。我希望获得

锚，而不是在'li'元素末尾的表中获得

锚。我希望我澄清了这一点…

如果我正确理解了您的问题，您只需要class=r的H3的锚它位于一个div下，class=vsc，但返回了多个H3节点

如果这是正确的，您还需要在查询中指定div的类，就像对h3所做的那样：

//div[@class='vsc']/h3[@class='r'//a

如果不是这样，那么请更新您的问题，提供更详细的信息和更广泛的xml示例，其中包含您所指的不明确数据，我将完善我的答案，希望这有帮助

需要注意的是：使用“/”告诉XPath从“根”或开头开始，因此//h3的XPath将返回name=“h3”的所有节点

编辑： 如果希望锚定位于div中而不在table元素中，只需像这样使用祖先函数：

//h3[@class='r' and not(ancestor::table)]//a

希望这有帮助，如果我需要澄清任何其他问题，请告诉我！

如果我正确理解您的问题，您只需要类为r的h3的锚，该锚位于类为vsc的div下。但是您返回了多个h3节点

如果这是正确的，您还需要在查询中指定div的类，就像对h3所做的那样：

//div[@class='vsc']/h3[@class='r'//a

如果不是这样，那么请更新您的问题，提供更详细的信息和更广泛的xml示例，其中包含您所指的不明确数据，我将完善我的答案，希望这有帮助

需要注意的是：使用“/”告诉XPath从“根”或开头开始，因此//h3的XPath将返回name=“h3”的所有节点

编辑： 如果希望锚定位于div中而不在table元素中，只需像这样使用祖先函数：

//h3[@class='r' and not(ancestor::table)]//a

希望这能有所帮助，如果我还需要澄清什么，请告诉我