使用php刮取图像HTML页面源代码_Php_Html_Web Scraping

使用php刮取图像HTML页面源代码

php html web-scraping

使用php刮取图像HTML页面源代码,php,html,web-scraping,Php,Html,Web Scraping,我有从html网页上抓取图像的功能这是我想搜集的html源代码 <div class="single-post-thumb"> <img width="448" height="298" src="http://www.website.com/wp-content/uploads/2015/02/DSC_2803.jpg" class="attachment-660x330" alt="Description image" title="Description

我有从html网页上抓取图像的功能这是我想搜集的html源代码

<div class="single-post-thumb">
        <img width="448" height="298" src="http://www.website.com/wp-content/uploads/2015/02/DSC_2803.jpg" class="attachment-660x330" alt="Description image" title="Description title" />      </div>

这是我的刮片功能

public function process_individual_links($news_coll)
{       
    echo "Fetching Content - " . $news["news_url"]."". $news["news_images"] . "";   
    $news_coll = array_reverse($news_coll);
    //print_r($news_coll);
    foreach($news_coll as $news)
    {
        $news_url = $news["news_url"];
        $preview = $this->_http->request($news_url);
        $preview = $this->stripNewLine($preview);
    $expr = '#<div class="single-post-thumb"><img .*? src="(.*?)".*?/></div>.*?<div class="entry">(.*?)</div>#';
        preg_match_all($expr, $preview, $matches);
        $count = count($matches[0]) ;
        if($count == 0)
        {
            $expr = '#<div class="entry">(.*?)</div><!-- .entry /-->#';
            $news["news_images"] = str_replace('"', "", $match[1][0]);
            preg_match_all($expr, $preview, $matches);
            $news["news_content"] = $matches[1][0];
        }
        else
        {
            $news["news_images"] = str_replace('"', "", $match[1][0]);
            $news["news_content"] = $matches[2][0];
            echo" $news[news_images] ";
        }
        $imager = str_replace('"', "", $match[1][0]);
        $news["news_content"] = $news["news_content"] . "<p><a href='" . $news_url . "'>Sumber Berita</a></p>".$imager;
        if($this->insertIntoWordpress($news, "TNI") == "-1")                
            echo " ";           
        else                
            echo "Fetching Content - " . $news["news_url"]."". $news["news_images"] . "";
    }
}

public function process\u individual\u links（$news\u coll）
{       
echo“获取内容-”$news[“news\u url”]。$news[“news\u images”]；
$news\u coll=array\u reverse（$news\u coll）；
//印刷品（$news\u coll）；
foreach（$news\u coll作为$news）
{
$news_url=$news[“news_url”]；
$preview=$this->\uhttp->request（$news\uurl）；
$preview=$this->stripNewLine（$preview）；
$expr='#.*（.？）#'；
preg_match_all（$expr、$preview、$matches）；
$count=count（$matches[0]）；
如果（$count==0）
{
$expr='#（.*）#'；
$news[“news_images”]=str_replace（“，”，$match[1][0]）；
preg_match_all（$expr、$preview、$matches）；
$news[“news_content”]=$matches[1][0]；
}
其他的
{
$news[“news_images”]=str_replace（“，”，$match[1][0]）；
$news[“news_content”]=$matches[2][0]；
echo“$news[新闻图片]”；
}
$imager=str_replace（“，”，$match[1][0]）；
$news[“news\u content”]=$news[“news\u content”]。“”$imager；
如果（$this->insertiontowordpress（$news，“TNI”）==“-1”）
回声“；
其他的
echo“获取内容-”$news[“news\u url”]。$news[“news\u images”]；
}
}

我尝试在其他网站的工作，像这样的

没有高度和宽度前src

我调用这个表达式来刮取代码

$expr = '#<div class="single-post-thumb"><img .*? src="(.*?)".*?/></div>.*?<div class="entry">(.*?)</div>#';

$expr='#.*（.*）#；

Add

s（PCRE\u DOTALL）

关闭分隔符后，使点也匹配新行：…

\s

和标记之间的空格

\s*

。还要注意的是，

img.*src

需要两个空格，如果顺序是

，则改为…其中\s
是空格的缩写，\b
是单词边界。我已经尝试了code但仍然不工作有width=“640”height=“330”所以Andy看到了，用诸如width=“\d*”height=“\d*”
感谢jonny现在的工作。。。。。。