使用简单的HTML DOM PHP类进行HTML DOM抓取

使用简单的HTML DOM PHP类进行HTML DOM抓取,php,dom,scraper,Php,Dom,Scraper,在这个HTML代码段中,我很难找到“纯文本”(作者姓名) 我会在一页上有很多这样的内容。。我使用的是简单的HTMLDOM类 位于此处: 它很好用,也很容易理解。。我只是有点被困在如何瞄准我的“纯文本”(本演示中的作者姓名)上 某时xx:xx上午 作者姓名-机构名称 我需要从每个“块”中获取4个值,如下所示: 链接/路径-到目前为止抓取正确 到目前为止,标题抓取是正确的 作者姓名-这是我在确定目标时遇到的问题 机构名称-到目前为止抓取正确 以下是迄今为止我一直在使用/测试的PHP: forea

在这个HTML代码段中,我很难找到“纯文本”(作者姓名)

我会在一页上有很多这样的内容。。我使用的是简单的HTMLDOM类

位于此处:

它很好用,也很容易理解。。我只是有点被困在如何瞄准我的“纯文本”(本演示中的作者姓名)上


某时xx:xx上午

作者姓名-机构名称
我需要从每个“块”中获取4个值,如下所示:

链接/路径-到目前为止抓取正确

到目前为止,标题抓取是正确的

作者姓名-这是我在确定目标时遇到的问题

机构名称-到目前为止抓取正确

以下是迄今为止我一直在使用/测试的PHP:

foreach($html->find('tbody td a') as $element){
    echo 'LINK: ' . $parsedLink = substr($element->onclick, 13, -17) . '<br>';
    $title = $element->find('strong',0);
    echo 'TITLE: '. $title . '<br>';
    $institute = $element->parent()->last_child();
    echo 'INSTITUTE: '. $institute . '<br>';
    //$author = $element->parent()->find('text');
    $author = $element->parent()->last_child()->prev_sibling();
    echo 'AUTHOR: '. $author . '<br>';
}
foreach($html->find('tbody td a')作为$element){
echo“LINK:”.$parsedLink=substr($element->onclick,13,-17)。“
”; $title=$element->find('strong',0); 回显“TITLE:”.$TITLE.“
”; $institute=$element->parent()->last_child(); echo“INSTITUTE:”.$INSTITUTE.“
”; //$author=$element->parent()->find('text'); $author=$element->parent()->last_child()->prev_sibling(); 回显“作者:”.$AUTHOR.
; }
我试过使用inntertext、outtertext、纯文本、文本块等等

但是我似乎不能针对
元素之前的“纯文本”(innertext?)?(作者姓名文本)


如何确定/获取此值/元素/文本?

确定上述值/元素/文本的正确方法如下:

foreach($html->find('tbody td a[onclick]') as $element){
    $parsedLink = substr($element->onclick, 13, -17);
    $title = $element->find('strong',0);
    $author = $element->parent()->find('text'); // <-- returns array
    $institute = $element->parent()->last_child();
    echo 'LINK: ' . $parsedLink . '<br>';    
    echo 'TITLE: '. $title . '<br>';    
    echo 'AUTHOR: '. $author[2] . '<br>';
    echo 'INSTITUTE: '. $institute . '<br>';     
}
foreach($html->find('tbody td a[onclick]')作为$element){
$parsedLink=substr($element->onclick,13,-17);
$title=$element->find('strong',0);
$author=$element->parent()->find('text');//parent()->last_child();
echo“LINK:”.$parsedLink.
; 回显“TITLE:”.$TITLE.“
”; 回显“作者:”.$AUTHOR[2]。
; echo“INSTITUTE:”.$INSTITUTE.“
”; }
希望它能帮助别人


谢谢

就我个人而言,我厌倦了使用简单的html dom解析器,PHP简单html dom解析器的内存问题让我太头疼了,而且这个问题存在的时间太长,没有适合我口味的解决方案 (我知道,我知道,您只需要手动释放内存,但请尝试使用递归函数…)。事实是,最简单的解决方案是最好的,所以我开始使用 explode()函数,它足以解决98%的刮片问题(而且速度更快,创建和销毁dom对象需要一些时间)。试试这个:

class Scrap {

    private $link;
    private $title;
    private $institute;
    private $author;
    private $html;

    function __construct($url) {
        $this->html = $this->curlDownload($url);
    }

    private function curlDownload($Url){
        if (!function_exists('curl_init')){
            die('Sorry cURL is not installed!');
        }
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $Url);
        curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com");
        curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $output = curl_exec($ch);
        curl_close($ch);
        return $output;
    }

    function scrapLink() {
        if(empty($this->link)) {
            $link = explode('<td><a href="javascript:void(0)" onclick="window.open(\'', $this->html);
            $link = explode('\')', $link[1]);
            $link = $link[0];
            $this->link = $link;
        }
        return $this->link;
    }

    function scrapTitle() {
        if(empty($this->title)) {
            $title = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>', $this->html);
            $title = explode('</strong>', $title[1]);
            $title = $title[0];
            $this->title = $title;
        }
        return $this->title;
    }

    function scrapInstitute() {
        if(empty($this->institute)) {
            $institute = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />Author Name<em> -', $this->html);
            $institute = explode('</em>', $institute[1])
            $institute = trim($institute[0]);
            $this->institute = $institute;
        }
        return $this->institute;
    }

    function scrapAuthor() {
        if(empty($this->author)) {
            $author = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />', $this->html);
            $author = explode('<em>', $author[1])
            $author = $author[0];
            $this->author = $author;
        }
        return $this->author;
    }

    function scrapAll() {
        $this->scrapLink();
        $this->scrapTitle();
        $this->scrapInstitute();
        $this->scrapAuthor();
        return array($this->link, $this->title, $this->institute, $this->author);
    }
}
类废料{
私人$link;
私人产权;
私立研究所;
私人作家;
私人$html;
函数构造($url){
$this->html=$this->curlDownload($url);
}
私有函数下载($Url){
如果(!function_存在('curl_init')){
die('对不起,没有安装卷曲!');
}
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$URL);
curl_setopt($ch,CURLOPT_REFERER,”http://www.google.com");
curl_setopt($ch,CURLOPT_USERAGENT,“MozillaXYZ/1.0”);
curl_setopt($ch,CURLOPT_头,0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_超时,10);
$output=curl\u exec($ch);
卷曲关闭($ch);
返回$output;
}
函数scrapLink(){
if(空($this->link)){
$link=explode(“
作者名-”,$this->html); $institute=爆炸(“”,$institute[1]) $institute=trim($institute[0]); $this->institute=$institute; } 返回$this->institute; } 函数scrapAuthor(){ if(空($this->author)){ $author=explode(“
,$this->html); $author=explode(“”,$author[1]) $author=$author[0]; $this->author=$author; } 返回$this->author; } 函数scrapAll(){ $this->scrapLink(); $this->scrapTitle(); $this->scrapInstitute(); $this->scrapAuthor(); 返回数组($this->link,$this->title,$this->institute,$this->author); } }
class Scrap {

    private $link;
    private $title;
    private $institute;
    private $author;
    private $html;

    function __construct($url) {
        $this->html = $this->curlDownload($url);
    }

    private function curlDownload($Url){
        if (!function_exists('curl_init')){
            die('Sorry cURL is not installed!');
        }
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $Url);
        curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com");
        curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $output = curl_exec($ch);
        curl_close($ch);
        return $output;
    }

    function scrapLink() {
        if(empty($this->link)) {
            $link = explode('<td><a href="javascript:void(0)" onclick="window.open(\'', $this->html);
            $link = explode('\')', $link[1]);
            $link = $link[0];
            $this->link = $link;
        }
        return $this->link;
    }

    function scrapTitle() {
        if(empty($this->title)) {
            $title = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>', $this->html);
            $title = explode('</strong>', $title[1]);
            $title = $title[0];
            $this->title = $title;
        }
        return $this->title;
    }

    function scrapInstitute() {
        if(empty($this->institute)) {
            $institute = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />Author Name<em> -', $this->html);
            $institute = explode('</em>', $institute[1])
            $institute = trim($institute[0]);
            $this->institute = $institute;
        }
        return $this->institute;
    }

    function scrapAuthor() {
        if(empty($this->author)) {
            $author = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />', $this->html);
            $author = explode('<em>', $author[1])
            $author = $author[0];
            $this->author = $author;
        }
        return $this->author;
    }

    function scrapAll() {
        $this->scrapLink();
        $this->scrapTitle();
        $this->scrapInstitute();
        $this->scrapAuthor();
        return array($this->link, $this->title, $this->institute, $this->author);
    }
}