Php 卷曲刮_Php_Curl_Web Scraping

Php 卷曲刮

php curl web-scraping

Php 卷曲刮,php,curl,web-scraping,Php,Curl,Web Scraping,我试图用PHP CURL从一些网站上获取一些信息，问题是它提供的内容与用普通浏览器打开的内容不同示例站点如下所示：我正在尝试获取元标记，在浏览器中返回为： <meta name="title" content="Razmere v Preboldu se umirjajo" /> <meta name="description" content="Za prebivalci Prebolda je ne

我试图用PHP CURL从一些网站上获取一些信息，问题是它提供的内容与用普通浏览器打开的内容不同

示例站点如下所示：

我正在尝试获取元标记，在浏览器中返回为：

<meta name="title" content="Razmere v Preboldu se umirjajo" />
<meta name="description" content="Za prebivalci Prebolda je nemirna no&#269;, ki ji je sledilo jutro s &#353;e dodatnimi padavinami..." />
<link rel="image_src" href="http://web.vecer.com/portali/podatki/2010/09/19/slike/online_Prebold0-100.jpg" />
<link rel="target_url" href="http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=2010091905576453" />

我做错了什么？

您好，meta和所有其他可以使用的属性抓取

$target\u url=”http://stackoverflow.com/questions";
$userAgent='Googlebot/2.1(http://www.googlebot.com/bot.html)';
//向$target\u url发出cURL请求
$ch=curl_init（）；
curl_setopt（$ch，CURLOPT_USERAGENT，$USERAGENT）；
curl_setopt（$ch，CURLOPT_URL，$target_URL）；
curl_setopt（$ch，CURLOPT_FAILONERROR，true）；
curl_setopt（$ch，CURLOPT_FOLLOWLOCATION，true）；
curl_setopt（$ch，CURLOPT_AUTOREFERER，true）；
curl_setopt（$ch，CURLOPT_RETURNTRANSFER，true）；
curl_setopt（$ch，CURLOPT_超时，10）；
$html=curl\u exec（$ch）；
如果（！$html）{
echo“
卷曲错误号：”.cURL\u errno（$ch）；
echo“
卷曲错误：“.cURL\u错误（$ch）；
出口
}
//将html解析为文档
$dom=新的DOMDocument（）；
@$dom->loadHTML（$html）；
//抓取页面上的所有内容
$xpath=newdomxpath（$dom）；
$hrefs=$xpath->evaluate（“/html/body//a”）；
对于（$i=0；$i<$hrefs->length；$i++）{
$href=$hrefs->item（$i）；
$url=$href->getAttribute（'href'）；
//storeLink（$url，$target\u url）；
echo“
存储的链接：$url”；
}

您使用的是同一个useragent和同一个cookie吗？我不知道，我只是从其他示例中复制了从useragent到referer的代码，workweb服务器似乎没有根据您的用户代理发送不同的答复，但我猜您的问题可能完全是其他问题。我尝试了同一个user agent（Mozilla/5.0（Windows NT 6.1；WOW64；rv:18.0）Gecko/20100101 Firefox/18.0），但它不起作用，它可能与cookie有关，但我不知道cookie.txt设置是否起作用，我还尝试将一些重定向选项添加到true，没有问题是我使用了set_值（'url'））从codeigniter开始，为了安全起见，它对url中的奇怪字符进行了编码，现在所有问题都解决了。另外，我建议使用GoogleBot作为用户代理。

<title>VECER.COM: </title>
<meta name="title" content="" />
<meta name="description" content="" />
<link rel="image_src" href="-100.jpg" />
<link rel="target_url" href="http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=1899123000000000">

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6'); 
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");
    
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

$target_url = "http://stackoverflow.com/questions";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

  // grab all the on the page

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
//storeLink($url,$target_url);
echo "<br />Link stored: $url";
}