Python I';我无法使用BeautifulSoup从网页获取列表中的项目
我最近一直在阅读BeautifulSoup文档,以熟悉如何解析网站。所以这对我来说很新鲜。有人能告诉我为什么我不能在网站上看到最新的头条新闻吗(http://www.news-record.com)? 代码如下:Python I';我无法使用BeautifulSoup从网页获取列表中的项目,python,beautifulsoup,urllib2,Python,Beautifulsoup,Urllib2,我最近一直在阅读BeautifulSoup文档,以熟悉如何解析网站。所以这对我来说很新鲜。有人能告诉我为什么我不能在网站上看到最新的头条新闻吗(http://www.news-record.com)? 代码如下: import urllib2 import BeautifulSoup page = urllib2.urlopen("http://www.news-record.com/") soup = BeautifulSoup.BeautifulSoup(page) headli
import urllib2
import BeautifulSoup
page = urllib2.urlopen("http://www.news-record.com/")
soup = BeautifulSoup.BeautifulSoup(page)
headlines = []
headline = soup.find('a', ({"class" : "nrcTxt_headline"}))
while headline:
url = headline.findParent('div')['id']
headlines.append([url, headline.string])
headline = headline.findNext('span', {'class' : "nrcTxt_headline"})
print soup.headline
以下是感兴趣的网站中的部分:
<div class="nrcNav_menu">
<ul>
<li class="nrcTxt_menu1 nrcBlk_comboModTab nrc_default nrc_active">
<a href="#nrcMod_FP_Breaking">
Latest Headlines
</a>
</li>
<li class="nrcTxt_menu2 nrcBlk_comboModTab nrc_itemLast">
<a href="#nrcMod_FP_MostRead">
Most Read
</a>
</li>
</ul>
</div>
<div id="nrcMod_FP_Breaking" class="nrcBlk_comboModPage nrc_default nrc_active">
<h4 class="nrcTxt_modHed">
<span class="nrcTxt_label">
Latest Headlines
</span>
</h4>
<ul class="nrcBlk_artList">
<li class="nrcBlk_artHedOnly nrcBlk_art nrcBlk_art4">
<a class="nrcTxt_headline" href="/content/2012/02/28/article/city_hosts_earth_day_recycling_contest">
City hosts Earth Day recycling contest
</a>
<span class="nrcBlk_pubdate">
<span class="nrc_sep">
(
</span>
<!-- COLLAPSE WHITESPACE
-->
<span class="nrc_val">
3:53 pm
</span>
<!-- COLLAPSE WHITESPACE
-->
<span class="nrc_sep">
)
</span>
</span>
</li>
<li class="nrcBlk_artHedOnly nrcBlk_art nrcBlk_art5">
<a class="nrcTxt_headline" href="/content/2012/02/28/article/got_bull_rockingham_deputies_seek_900_pound_beast">
Got bull? Rockingham deputies seek 900-pound beast
</a>
如果删除代码中的括号,则会出现错误:
import urllib2
import BeautifulSoup
site = "http://www.news-record.com/"
page = urllib2.urlopen("http://www.news-record.com/")
soup = BeautifulSoup.BeautifulSoup(page)
headlines = []
for headline in soup.findAll('a', {"class" : "nrcTxt_headline"}):
url = headline.findParent('div', {"class":"nrcNav_menu"})
print ([headline.string])
请尝试以下代码:
import urllib2
import BeautifulSoup
site = "http://www.news-record.com/"
page = urllib2.urlopen("http://www.news-record.com/")
soup = BeautifulSoup.BeautifulSoup(page)
headlines = []
headlineList = soup.findAll('a', {"class" : "nrcTxt_headline"})
for headline in headlineList:
headlines.append(site+str(headline.get('href')))
print headlines
您将find参数包装在一个元组中,并没有尽可能地使用生成器
soup.find('a',({“class”:“nrcTxt\u headline”}))
你想要
soup.find('a',{“class”:“nrcTxt\u headline”})
例如:
for headline in soup.find('a', {"class" : "nrcTxt_headline"}):
url = headline.findParent('div')['id']
headlines.append([url, headline.string])
它给出了与我的代码相同的结果。我正在尝试获取页面上最新标题下的10条标题(href):如果你只是想要链接。。。我编辑了上面的代码,它应该返回一个链接列表。我建议您使用lxml解析器,它不使用regexp,而是运行fester-then-BeautifulSoup。
for headline in soup.find('a', {"class" : "nrcTxt_headline"}):
url = headline.findParent('div')['id']
headlines.append([url, headline.string])