使用简单的HTML DOM PHP类进行HTML DOM抓取_Php_Dom_Scraper

使用简单的HTML DOM PHP类进行HTML DOM抓取

php dom

使用简单的HTML DOM PHP类进行HTML DOM抓取,php,dom,scraper,Php,Dom,Scraper,在这个HTML代码段中，我很难找到“纯文本”（作者姓名）我会在一页上有很多这样的内容。。我使用的是简单的HTMLDOM类位于此处：它很好用，也很容易理解。。我只是有点被困在如何瞄准我的“纯文本”（本演示中的作者姓名）上某时xx:xx上午作者姓名-机构名称我需要从每个“块”中获取4个值，如下所示：链接/路径-到目前为止抓取正确到目前为止，标题抓取是正确的作者姓名-这是我在确定目标时遇到的问题机构名称-到目前为止抓取正确以下是迄今为止我一直在使用/测试的PHP： forea

在这个HTML代码段中，我很难找到“纯文本”（作者姓名）

我会在一页上有很多这样的内容。。我使用的是简单的HTMLDOM类

位于此处：

它很好用，也很容易理解。。我只是有点被困在如何瞄准我的“纯文本”（本演示中的作者姓名）上


某时xx:xx上午

作者姓名-机构名称

我需要从每个“块”中获取4个值，如下所示：

链接/路径-到目前为止抓取正确

到目前为止，标题抓取是正确的

作者姓名-这是我在确定目标时遇到的问题

机构名称-到目前为止抓取正确

以下是迄今为止我一直在使用/测试的PHP：

foreach($html->find('tbody td a') as $element){
    echo 'LINK: ' . $parsedLink = substr($element->onclick, 13, -17) . '<br>';
    $title = $element->find('strong',0);
    echo 'TITLE: '. $title . '<br>';
    $institute = $element->parent()->last_child();
    echo 'INSTITUTE: '. $institute . '<br>';
    //$author = $element->parent()->find('text');
    $author = $element->parent()->last_child()->prev_sibling();
    echo 'AUTHOR: '. $author . '<br>';
}

foreach（$html->find（'tbody td a'）作为$element）{
echo“LINK:”.$parsedLink=substr（$element->onclick，13，-17）。“
”；
$title=$element->find（'strong'，0）；
回显“TITLE:”.$TITLE.“
”；
$institute=$element->parent（）->last_child（）；
echo“INSTITUTE:”.$INSTITUTE.“
”；
//$author=$element->parent（）->find（'text'）；
$author=$element->parent（）->last_child（）->prev_sibling（）；
回显“作者：”.$AUTHOR.
；
}

我试过使用inntertext、outtertext、纯文本、文本块等等

但是我似乎不能针对

元素之前的“纯文本”（innertext？）？（作者姓名文本）

如何确定/获取此值/元素/文本？

确定上述值/元素/文本的正确方法如下：

foreach($html->find('tbody td a[onclick]') as $element){
    $parsedLink = substr($element->onclick, 13, -17);
    $title = $element->find('strong',0);
    $author = $element->parent()->find('text'); // <-- returns array
    $institute = $element->parent()->last_child();
    echo 'LINK: ' . $parsedLink . '<br>';    
    echo 'TITLE: '. $title . '<br>';    
    echo 'AUTHOR: '. $author[2] . '<br>';
    echo 'INSTITUTE: '. $institute . '<br>';     
}

foreach（$html->find（'tbody td a[onclick]'）作为$element）{
$parsedLink=substr（$element->onclick，13，-17）；
$title=$element->find（'strong'，0）；
$author=$element->parent（）->find（'text'）；//parent（）->last_child（）；
echo“LINK:”.$parsedLink.
；
回显“TITLE:”.$TITLE.“
”；
回显“作者：”.$AUTHOR[2]。
；
echo“INSTITUTE:”.$INSTITUTE.“
”；
}

希望它能帮助别人

谢谢

就我个人而言，我厌倦了使用简单的html dom解析器，PHP简单html dom解析器的内存问题让我太头疼了，而且这个问题存在的时间太长，没有适合我口味的解决方案（我知道，我知道，您只需要手动释放内存，但请尝试使用递归函数…）。事实是，最简单的解决方案是最好的，所以我开始使用 explode（）函数，它足以解决98%的刮片问题（而且速度更快，创建和销毁dom对象需要一些时间）。试试这个：

class Scrap {

    private $link;
    private $title;
    private $institute;
    private $author;
    private $html;

    function __construct($url) {
        $this->html = $this->curlDownload($url);
    }

    private function curlDownload($Url){
        if (!function_exists('curl_init')){
            die('Sorry cURL is not installed!');
        }
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $Url);
        curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com");
        curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $output = curl_exec($ch);
        curl_close($ch);
        return $output;
    }

    function scrapLink() {
        if(empty($this->link)) {
            $link = explode('<td><a href="javascript:void(0)" onclick="window.open(\'', $this->html);
            $link = explode('\')', $link[1]);
            $link = $link[0];
            $this->link = $link;
        }
        return $this->link;
    }

    function scrapTitle() {
        if(empty($this->title)) {
            $title = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>', $this->html);
            $title = explode('</strong>', $title[1]);
            $title = $title[0];
            $this->title = $title;
        }
        return $this->title;
    }

    function scrapInstitute() {
        if(empty($this->institute)) {
            $institute = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />Author Name<em> -', $this->html);
            $institute = explode('</em>', $institute[1])
            $institute = trim($institute[0]);
            $this->institute = $institute;
        }
        return $this->institute;
    }

    function scrapAuthor() {
        if(empty($this->author)) {
            $author = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />', $this->html);
            $author = explode('<em>', $author[1])
            $author = $author[0];
            $this->author = $author;
        }
        return $this->author;
    }

    function scrapAll() {
        $this->scrapLink();
        $this->scrapTitle();
        $this->scrapInstitute();
        $this->scrapAuthor();
        return array($this->link, $this->title, $this->institute, $this->author);
    }
}

类废料{
私人$link；
私人产权；
私立研究所；
私人作家；
私人$html；
函数构造（$url）{
$this->html=$this->curlDownload（$url）；
}
私有函数下载（$Url）{
如果（！function_存在（'curl_init'））{
die（'对不起，没有安装卷曲！'）；
}
$ch=curl_init（）；
curl_setopt（$ch，CURLOPT_URL，$URL）；
curl_setopt（$ch，CURLOPT_REFERER，”http://www.google.com");
curl_setopt（$ch，CURLOPT_USERAGENT，“MozillaXYZ/1.0”）；
curl_setopt（$ch，CURLOPT_头，0）；
curl_setopt（$ch，CURLOPT_RETURNTRANSFER，true）；
curl_setopt（$ch，CURLOPT_超时，10）；
$output=curl\u exec（$ch）；
卷曲关闭（$ch）；
返回$output；
}
函数scrapLink（）{
if（空（$this->link））{
$link=explode（“
作者名-”，$this->html）；
$institute=爆炸（“”，$institute[1]）
$institute=trim（$institute[0]）；
$this->institute=$institute；
}
返回$this->institute；
}
函数scrapAuthor（）{
if（空（$this->author））{
$author=explode（“
，$this->html）；
$author=explode（“”，$author[1]）
$author=$author[0]；
$this->author=$author；
}
返回$this->author；
}
函数scrapAll（）{
$this->scrapLink（）；
$this->scrapTitle（）；
$this->scrapInstitute（）；
$this->scrapAuthor（）；
返回数组（$this->link，$this->title，$this->institute，$this->author）；
}
}

class Scrap {

    private $link;
    private $title;
    private $institute;
    private $author;
    private $html;

    function __construct($url) {
        $this->html = $this->curlDownload($url);
    }

    private function curlDownload($Url){
        if (!function_exists('curl_init')){
            die('Sorry cURL is not installed!');
        }
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $Url);
        curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com");
        curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $output = curl_exec($ch);
        curl_close($ch);
        return $output;
    }

    function scrapLink() {
        if(empty($this->link)) {
            $link = explode('<td><a href="javascript:void(0)" onclick="window.open(\'', $this->html);
            $link = explode('\')', $link[1]);
            $link = $link[0];
            $this->link = $link;
        }
        return $this->link;
    }

    function scrapTitle() {
        if(empty($this->title)) {
            $title = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>', $this->html);
            $title = explode('</strong>', $title[1]);
            $title = $title[0];
            $this->title = $title;
        }
        return $this->title;
    }

    function scrapInstitute() {
        if(empty($this->institute)) {
            $institute = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />Author Name<em> -', $this->html);
            $institute = explode('</em>', $institute[1])
            $institute = trim($institute[0]);
            $this->institute = $institute;
        }
        return $this->institute;
    }

    function scrapAuthor() {
        if(empty($this->author)) {
            $author = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />', $this->html);
            $author = explode('<em>', $author[1])
            $author = $author[0];
            $this->author = $author;
        }
        return $this->author;
    }

    function scrapAll() {
        $this->scrapLink();
        $this->scrapTitle();
        $this->scrapInstitute();
        $this->scrapAuthor();
        return array($this->link, $this->title, $this->institute, $this->author);
    }
}