使用php刮取图像HTML页面源代码
我有从html网页上抓取图像的功能 这是我想搜集的html源代码使用php刮取图像HTML页面源代码,php,html,web-scraping,Php,Html,Web Scraping,我有从html网页上抓取图像的功能 这是我想搜集的html源代码 <div class="single-post-thumb"> <img width="448" height="298" src="http://www.website.com/wp-content/uploads/2015/02/DSC_2803.jpg" class="attachment-660x330" alt="Description image" title="Description
<div class="single-post-thumb">
<img width="448" height="298" src="http://www.website.com/wp-content/uploads/2015/02/DSC_2803.jpg" class="attachment-660x330" alt="Description image" title="Description title" /> </div>
这是我的刮片功能
public function process_individual_links($news_coll)
{
echo "Fetching Content - " . $news["news_url"]."". $news["news_images"] . "";
$news_coll = array_reverse($news_coll);
//print_r($news_coll);
foreach($news_coll as $news)
{
$news_url = $news["news_url"];
$preview = $this->_http->request($news_url);
$preview = $this->stripNewLine($preview);
$expr = '#<div class="single-post-thumb"><img .*? src="(.*?)".*?/></div>.*?<div class="entry">(.*?)</div>#';
preg_match_all($expr, $preview, $matches);
$count = count($matches[0]) ;
if($count == 0)
{
$expr = '#<div class="entry">(.*?)</div><!-- .entry /-->#';
$news["news_images"] = str_replace('"', "", $match[1][0]);
preg_match_all($expr, $preview, $matches);
$news["news_content"] = $matches[1][0];
}
else
{
$news["news_images"] = str_replace('"', "", $match[1][0]);
$news["news_content"] = $matches[2][0];
echo" $news[news_images] ";
}
$imager = str_replace('"', "", $match[1][0]);
$news["news_content"] = $news["news_content"] . "<p><a href='" . $news_url . "'>Sumber Berita</a></p>".$imager;
if($this->insertIntoWordpress($news, "TNI") == "-1")
echo " ";
else
echo "Fetching Content - " . $news["news_url"]."". $news["news_images"] . "";
}
}
public function process\u individual\u links($news\u coll)
{
echo“获取内容-”$news[“news\u url”]。$news[“news\u images”];
$news\u coll=array\u reverse($news\u coll);
//印刷品($news\u coll);
foreach($news\u coll作为$news)
{
$news_url=$news[“news_url”];
$preview=$this->\uhttp->request($news\uurl);
$preview=$this->stripNewLine($preview);
$expr='#.*(.?)#';
preg_match_all($expr、$preview、$matches);
$count=count($matches[0]);
如果($count==0)
{
$expr='#(.*)#';
$news[“news_images”]=str_replace(“,”,$match[1][0]);
preg_match_all($expr、$preview、$matches);
$news[“news_content”]=$matches[1][0];
}
其他的
{
$news[“news_images”]=str_replace(“,”,$match[1][0]);
$news[“news_content”]=$matches[2][0];
echo“$news[新闻图片]”;
}
$imager=str_replace(“,”,$match[1][0]);
$news[“news\u content”]=$news[“news\u content”]。“”$imager;
如果($this->insertiontowordpress($news,“TNI”)==“-1”)
回声“;
其他的
echo“获取内容-”$news[“news\u url”]。$news[“news\u images”];
}
}
我尝试在其他网站的工作,像这样的
没有高度和宽度前src
我调用这个表达式来刮取代码
$expr = '#<div class="single-post-thumb"><img .*? src="(.*?)".*?/></div>.*?<div class="entry">(.*?)</div>#';
$expr='#.*(.*)#;
Adds(PCRE\u DOTALL)
关闭分隔符后,使点也匹配新行:…\s
和标记之间的空格\s*
。还要注意的是,img.*src
需要两个空格,如果顺序是,则改为…其中\s
是空格的缩写,\b
是单词边界。我已经尝试了code但仍然不工作有width=“640”height=“330”所以Andy看到了,用诸如width=“\d*”height=“\d*”
感谢jonny现在的工作。。。。。。