Php googlescholar中的Web抓取_Php_Curl_Web Scraping_Google Scholar

Php googlescholar中的Web抓取

php curl web-scraping

Php googlescholar中的Web抓取,php,curl,web-scraping,google-scholar,Php,Curl,Web Scraping,Google Scholar,我正试图从Google Scholar个人资料页面中获取信息。我的想法是，我想用XPath检索出版物列表，但我没有下载该页面，以下是我的代码：我试过卷发 function get_page($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET'); //I tried to change user agent as well //c

我正试图从Google Scholar个人资料页面中获取信息。我的想法是，我想用XPath检索出版物列表，但我没有下载该页面，以下是我的代码：我试过卷发

function get_page($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
//I tried to change user agent as well
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1;  en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
return $response;
}

$xpath = get_xpath(get_page($query_url));

没有卷曲：

function get_xpath($query_url) {
 $dom = new DOMDocument();
@$dom->loadHTMLFile($query_url);
sleep(1);
return new DOMXpath($dom);
}

$query_url = "https://scholar.google.it/citations?user=p-POZjgAAAAJ&hl=it&cstart=0&pagesize=100";

不卷曲地得到它

$xpath = get_xpath($query_url);

卷曲

function get_page($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
//I tried to change user agent as well
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1;  en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
return $response;
}

$xpath = get_xpath(get_page($query_url));

然后

$autori=$xpath->query("//tr[1]/td[1]/div[1]");

但是$autori一直是空的，知道吗？

您可以使用浏览器插件测试xpath查询，如。检查您的xpath是否有效，然后您可以找到xpath或curl的问题所在。永远不要使用

。这相当于将手指塞进耳朵，然后说“拉拉听不见”。另外，你只是简单地假设你的困境实际上是成功的。完成了一些基本的调试，比如检查url是否可以被刮取<代码>回显文件\u获取\u内容（$query\u url）？检查您是否得到了您认为应该得到的页面？xpath有效，我已经检查过了，谢谢@ElvinValiev@MarcB当然，我试过了，因为我说我不能下载这个页面，因为文件内容总是给我一个服务器错误，我想它可能会被刮掉，因为我刮到了另一个google scholar页面（虽然不是个人资料页面…），我抑制了这个错误，因为它总是给我这个错误“警告：DOMDocument:：loadHTMLFile（）：意外结束标记：表中，第19行C:\xampp\htdocs\GAD\numberOfCommonPapers.php中的第1行”