用php抓取Google首页结果_Php_Screen Scraping

用php抓取Google首页结果

php

用php抓取Google首页结果,php,screen-scraping,Php,Screen Scraping,我可以用php代码从谷歌搜索结果中抓取标题和url，现在如何获取描述 $url = 'http://www.google.com/search?hl=en&safe=active&tbo=d&site=&source=hp&q=Beautiful+Bangladesh&oq=Beautiful+Bangladesh'; $html = file_get_html($url); $linkObjs = $html->find('h3.r a

我可以用php代码从谷歌搜索结果中抓取标题和url，现在如何获取描述

$url  = 'http://www.google.com/search?hl=en&safe=active&tbo=d&site=&source=hp&q=Beautiful+Bangladesh&oq=Beautiful+Bangladesh';
$html = file_get_html($url);

$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
    $title = trim($linkObj->plaintext);
    $link  = trim($linkObj->href);

    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;
    }

    echo '<p>Title: ' . $title . '<br />';
    echo 'Link: ' . $link . '</p>';
}

现在我想要以下输出

Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/

Title: Natural Beauties - Bangladesh Photo Gallery
Link: http://www.photo.com.bd/Beauties/
description : photo.com.bd is a website for creative photographers from Bangladesh, mainly for amateur ... Natural-Beauty-of-Bangladesh_Flower · fishing on ... BEAUTY-4.

include（“simple_html_dom.php”）；
$in=“美丽的孟加拉国”；
$in=str_替换（“”，“+”，$in）；//空间是一个+
$url='1http://www.google.com/search?hl=en&tbo=d&site=&source=hp&q=“.$in.”&oq=“.$in.”；
打印$url。“
”；
$html=file\u get\u html（$url）；
$i=0；
$linkObjs=$html->find（'h3.ra'）；
foreach（$linkObjs作为$linkObj）{
$title=trim（$linkObj->明文）；
$link=trim（$linkObj->href）；
//如果它不是直接链接，而是在其中找到的url引用，则提取
如果（！preg_-match（'/^https？/'，$link）&&preg_-match（'/q=（.+）&sa=/U'，$link，$matches）&&preg_-match（'/^https？/'，$matches[1]）{
$link=$matches[1]；
}如果（！preg_match（'/^https？/'，$link））{//如果不是有效链接，则跳过
持续
}
$descr=$html->find（'span.st'，$i）；//description不是H3的子元素，因此我们使用计数器并重新检查。
$i++；
回显“Title:”.$Title.“
”；
回显“Link:”.$Link.“
”；
echo“Description:”.$descr.”；
}

您正在解析的HTML是什么？你做了什么尝试来解析它？这种尝试在什么方面没有达到预期效果？文件名不能回答这些问题，也不能为您的问题提供任何清晰的答案。目前，你似乎在要求某人为你工作。堆栈溢出并不是这样做的。如果你想找人给你的代码添加功能，你应该雇佣一个开发人员。如果您试图向代码中添加功能，但遇到了问题，我们很乐意提供帮助。但是你需要描述你所做的尝试和遇到的问题。

include("simple_html_dom.php");

$in = "Beautiful Bangladesh";
$in = str_replace(' ','+',$in); // space is a +
$url  = 'http://www.google.com/search?hl=en&tbo=d&site=&source=hp&q='.$in.'&oq='.$in.'';

print $url."<br>";

$html = file_get_html($url);

$i=0;
$linkObjs = $html->find('h3.r a'); 
foreach ($linkObjs as $linkObj) {
    $title = trim($linkObj->plaintext);
    $link  = trim($linkObj->href);

    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;
    }

    $descr = $html->find('span.st',$i); // description is not a child element of H3 thereforce we use a counter and recheck.
    $i++;   
    echo '<p>Title: ' . $title . '<br />';
    echo 'Link: ' . $link . '<br />';
    echo 'Description: ' . $descr . '</p>';
}