Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/298.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用php刮取图像HTML页面源代码_Php_Html_Web Scraping - Fatal编程技术网

使用php刮取图像HTML页面源代码

使用php刮取图像HTML页面源代码,php,html,web-scraping,Php,Html,Web Scraping,我有从html网页上抓取图像的功能 这是我想搜集的html源代码 <div class="single-post-thumb"> <img width="448" height="298" src="http://www.website.com/wp-content/uploads/2015/02/DSC_2803.jpg" class="attachment-660x330" alt="Description image" title="Description

我有从html网页上抓取图像的功能 这是我想搜集的html源代码

<div class="single-post-thumb">
        <img width="448" height="298" src="http://www.website.com/wp-content/uploads/2015/02/DSC_2803.jpg" class="attachment-660x330" alt="Description image" title="Description title" />      </div>

这是我的刮片功能

public function process_individual_links($news_coll)
{       
    echo "Fetching Content - " . $news["news_url"]."". $news["news_images"] . "";   
    $news_coll = array_reverse($news_coll);
    //print_r($news_coll);
    foreach($news_coll as $news)
    {
        $news_url = $news["news_url"];
        $preview = $this->_http->request($news_url);
        $preview = $this->stripNewLine($preview);
    $expr = '#<div class="single-post-thumb"><img .*? src="(.*?)".*?/></div>.*?<div class="entry">(.*?)</div>#';
        preg_match_all($expr, $preview, $matches);
        $count = count($matches[0]) ;
        if($count == 0)
        {
            $expr = '#<div class="entry">(.*?)</div><!-- .entry /-->#';
            $news["news_images"] = str_replace('"', "", $match[1][0]);
            preg_match_all($expr, $preview, $matches);
            $news["news_content"] = $matches[1][0];
        }
        else
        {
            $news["news_images"] = str_replace('"', "", $match[1][0]);
            $news["news_content"] = $matches[2][0];
            echo" $news[news_images] ";
        }
        $imager = str_replace('"', "", $match[1][0]);
        $news["news_content"] = $news["news_content"] . "<p><a href='" . $news_url . "'>Sumber Berita</a></p>".$imager;
        if($this->insertIntoWordpress($news, "TNI") == "-1")                
            echo " ";           
        else                
            echo "Fetching Content - " . $news["news_url"]."". $news["news_images"] . "";
    }
}
public function process\u individual\u links($news\u coll)
{       
echo“获取内容-”$news[“news\u url”]。$news[“news\u images”];
$news\u coll=array\u reverse($news\u coll);
//印刷品($news\u coll);
foreach($news\u coll作为$news)
{
$news_url=$news[“news_url”];
$preview=$this->\uhttp->request($news\uurl);
$preview=$this->stripNewLine($preview);
$expr='#.*(.?)#';
preg_match_all($expr、$preview、$matches);
$count=count($matches[0]);
如果($count==0)
{
$expr='#(.*)#';
$news[“news_images”]=str_replace(“,”,$match[1][0]);
preg_match_all($expr、$preview、$matches);
$news[“news_content”]=$matches[1][0];
}
其他的
{
$news[“news_images”]=str_replace(“,”,$match[1][0]);
$news[“news_content”]=$matches[2][0];
echo“$news[新闻图片]”;
}
$imager=str_replace(“,”,$match[1][0]);
$news[“news\u content”]=$news[“news\u content”]。“

”$imager; 如果($this->insertiontowordpress($news,“TNI”)==“-1”) 回声“; 其他的 echo“获取内容-”$news[“news\u url”]。$news[“news\u images”]; } }
我尝试在其他网站的工作,像这样的
没有高度和宽度前src

我调用这个表达式来刮取代码

$expr = '#<div class="single-post-thumb"><img .*? src="(.*?)".*?/></div>.*?<div class="entry">(.*?)</div>#';
$expr='#.*(.*)#;

Add
s(PCRE\u DOTALL)
关闭分隔符后,使点也匹配新行:…
\s
和标记之间的空格
\s*
。还要注意的是,
img.*src
需要两个空格,如果顺序是
,则改为
…其中
\s
是空格的缩写,
\b
是单词边界。我已经尝试了code但仍然不工作有width=“640”height=“330”所以Andy看到了,用诸如
width=“\d*”height=“\d*”
感谢jonny现在的工作。。。。。。