Php 抓取HN首页-处理简单HTML Dom错误

Php 抓取HN首页-处理简单HTML Dom错误,php,xml,dom,web-scraping,bootswatch,Php,Xml,Dom,Web Scraping,Bootswatch,我正在使用“简单HTMLDOM”浏览HN的首页(news.ycombinator.com),这在大多数情况下都很有效 然而,他们时不时地推销一份工作/一家公司,而这家公司缺少scraper所寻找的要素,即分数、用户名和评论数 当然,这会破坏数组,从而破坏脚本的输出: <?php // 2012-02-12 Maximilian (Extract news.ycombinator.com's Front Page) // Set the header during developme

我正在使用“简单HTMLDOM”浏览HN的首页(news.ycombinator.com),这在大多数情况下都很有效

然而,他们时不时地推销一份工作/一家公司,而这家公司缺少scraper所寻找的要素,即分数、用户名和评论数

当然,这会破坏数组,从而破坏脚本的输出:

 <?php

// 2012-02-12 Maximilian (Extract news.ycombinator.com's Front Page)

// Set the header during development
//header ("content-type: text/xml");

// Call the external PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm)
include('lib/simple_html_dom.php');

date_default_timezone_set('Europe/Berlin');

// Download 'news.ycombinator.com' content
//$tmp = file_get_contents('http://news.ycombinator.com');
//file_put_contents('get.tmp', $tmp);

// Retrieve the content
$html = file_get_html('tc.tmp');

// Set the extraction pattern for each item
$title = $html->find("tr td table tr td.title a");
$score = $html->find("tr td.subtext span");
$user = $html->find("tr td.subtext a[href^=user]");
$link = $html->find("tr td table tr td.title a");
$time = $html->find("tr td.subtext");
$additionals = $html->find("tr td.subtext a[href^=item?id]");

// Construct the feed by looping through the items
for($i=0;$i<29;$i++) {

$cr=1;

// Check if the item points to an external website
if (!strstr($link[$i]->href,'http')) {

$url = 'http://news.ycombinator.com/'.$link[$i]->href;
$description = "Join the discussion on Hacker News.";


} else {

$url = $link[$i]->href;

// Getting content here

if (empty($abstract)) {

$description ="Failed to load any relevant content. Please try again later.";

} else {

$description = $abstract;

}

}
// Put all the items together
  $result .= '<item><id>f'.$i.'</id><title>'.htmlspecialchars(trim($title[$i]->plaintext)).'</title><description><![CDATA['.$description.']]></description><pubDate>'.str_replace('  | '.$additionals[$i]->plaintext,'',str_replace($score[$i]->plaintext.' by '.$user[$i]->plaintext.' ','',$time[$i]->plaintext)).'</pubDate><score>'.$score[$i]->plaintext.'</score><user>'.$user[$i]->plaintext.'</user><comments>'.$additionals[$i]->plaintext.'</comments><id>'.substr($additionals[$i]->href,8).'</id><discussion>http://news.ycombinator.com/'.$additionals[$i]->href.'</discussion><link>'.htmlspecialchars($url).'</link></item>'; 
}

$output = '<rss><channel><id>news.ycombinator.com Frontpage</id><buildDate>'.date('Y-m-d H:i:s').'</buildDate>'.$result.'</channel></rss>';

file_put_contents('tc.xml', $output);


?>
“.str_replace(“|”。$additionals[$i]->纯文本,”,str_replace($score[$i]->纯文本,“$user[$i]->纯文本,”,“$score[$i]->纯文本”。“$additionals[$i]->纯文本。”。$additionals[$i]->纯文本。”。substr($additionals[$i]->href,8)。”http://news.ycombinator.com/“.$additionals[$i]->href.”.htmlspecialchars($url).''; 
}
$output='news.ycombinator.com Frontpage'。日期('Y-m-dh:i:s')。'$result';
文件内容('tc.xml',$output);
?>
下面是一个正确输出的示例

<item>
<id>f0</id>
<title>Show HN: Bootswatch, free swatches for your Bootstrap site</title>
<description><![CDATA[Easy to Install Simply download the CSS file from the swatch of your choice and replace the one in Bootstrap. No messing around with hex values. Whole New Feel We've all been there with the black bar and blue buttons. See how a splash of color and typography can transform the feel of your site. Modular Changes are contained in just two LESS files, enabling modification and ensuring forward compatibility.]]></description>
<pubDate>3 hours ago</pubDate>
<score>196 points</score>
<user>parkov</user>
<comments>30 comments</comments>
<id>3594540</id>
<discussion>http://news.ycombinator.com/item?id=3594540</discussion>
<link>http://bootswatch.com</link>
</item>
<item>
<id>f1</id>
<title>Louis CK inspires Jim Gaffigan to sell comedy special for $5 online</title>
<description><![CDATA[Dear Internet Friends,Inspired by the brilliant Louis CK, I have decided to debut my all-new hour stand-up special on my website, Jimgaffigan.com.Beginning sometime in April, “Jim Gaffigan: Mr. Universe” will be available exclusively for download for only $5. A dollar from each download will go directly to The Bob Woodruff Foundation; a charity dedicated to serving injured Veterans and their families.I am confident that the low price of my new comedy special and the fact that 20% of each $5 download will be donated to this very noble cause will prevent people from stealing it. Maybe I’m being naïve, but I trust you guys.]]></description>
<pubDate>57 minutes ago</pubDate>
<score>25 points</score>
<user>rkudeshi</user>
<comments>4 comments</comments>
<id>3595285</id>
<discussion>http://news.ycombinator.com/item?id=3595285</discussion>
<link>http://www.whosay.com/jimgaffigan/content/218011</link>
</item>

f0
Show HN:Bootswatch,为您的引导站点提供免费样例

f14
打造下一个乐高:我们正在聘请iOS开发者&;网络开发者(YC S11)
2小时前
14分
B正确
7评论
3594944
http://news.ycombinator.com/item?id=3594944
http://launchpadtoys.com/blog/2012/02/iosdeveloper-webdeveloper/
f15
SOPA仇人Fred Wilson支持盗版网站的黑名单

为了处理这个问题,您必须按块工作,似乎有一个虚拟间隔元素可以帮助您:

$news = preg_split('/<tr style="height:5px"><\/tr>/',$html->find('tbody',2)->innertext);

对于使用相同选择器的其他元素

多亏了Ivan的思路,我现在将最初刮取的HTML拆分为一个数组,每个节点代表一篇文章。然后,在循环中检查每一篇文章,我将检查是否存在向上投票箭头图像。如果没有,我不会将其添加到结果中。最后,一切都将重新缝合在一起,赞助的帖子将被忽略。代码如下:

$array = explode('<tr style="height:5px"></tr>',$html);
foreach ($array as $post) {

    if (!strstr($post,'grayarrow.gif')){}else{

    $clean .=  $post;

    }

}
unset($array);
$html = str_get_html($clean.'</body></html>');
$array=explode(“”,$html);
foreach($array as$post){
如果(!strstr($post,'grayarrow.gif')){}else{
$clean.=$post;
}
}
未设置($数组);
$html=str_get_html($clean.');

他们也有你知道的…@Radu确实有,但我正在努力获取:发布时间、评论数量、发布者用户名以及提交的分数。非常感谢你提出的将页面分为多个部分的建议。我已经用我的解决方案更新了上面的代码嗨,你应该让问题保持原样,这样人们才能理解你最初的方法是什么。。。这样他们就能明白你的问题是什么
foreach($news as $article){
    $article = str_get_html($article)
    // No upvote arrow found so its not a valid article
    if(count($article->find('img')) === 0){
        continue;
    }
}
$array = explode('<tr style="height:5px"></tr>',$html);
foreach ($array as $post) {

    if (!strstr($post,'grayarrow.gif')){}else{

    $clean .=  $post;

    }

}
unset($array);
$html = str_get_html($clean.'</body></html>');