Python I'；我无法使用BeautifulSoup从网页获取列表中的项目_Python_Beautifulsoup_Urllib2

Python I'；我无法使用BeautifulSoup从网页获取列表中的项目

python

Python I'；我无法使用BeautifulSoup从网页获取列表中的项目,python,beautifulsoup,urllib2,Python,Beautifulsoup,Urllib2,我最近一直在阅读BeautifulSoup文档，以熟悉如何解析网站。所以这对我来说很新鲜。有人能告诉我为什么我不能在网站上看到最新的头条新闻吗(http://www.news-record.com)? 代码如下： import urllib2 import BeautifulSoup page = urllib2.urlopen("http://www.news-record.com/") soup = BeautifulSoup.BeautifulSoup(page) headli

我最近一直在阅读BeautifulSoup文档，以熟悉如何解析网站。所以这对我来说很新鲜。有人能告诉我为什么我不能在网站上看到最新的头条新闻吗(http://www.news-record.com)?

代码如下：

import urllib2
import BeautifulSoup


page = urllib2.urlopen("http://www.news-record.com/")

soup = BeautifulSoup.BeautifulSoup(page)



headlines = []



headline = soup.find('a', ({"class" : "nrcTxt_headline"}))
while headline:
    url = headline.findParent('div')['id']
    headlines.append([url, headline.string])
    headline = headline.findNext('span', {'class' : "nrcTxt_headline"})


print soup.headline

以下是感兴趣的网站中的部分：

<div class="nrcNav_menu">
   <ul>
    <li class="nrcTxt_menu1 nrcBlk_comboModTab nrc_default nrc_active">
     <a href="#nrcMod_FP_Breaking">
      Latest Headlines
     </a>
    </li>
    <li class="nrcTxt_menu2 nrcBlk_comboModTab nrc_itemLast">
     <a href="#nrcMod_FP_MostRead">
      Most Read
     </a>
    </li>
   </ul>
  </div>
  <div id="nrcMod_FP_Breaking" class="nrcBlk_comboModPage nrc_default nrc_active">
   <h4 class="nrcTxt_modHed">
    <span class="nrcTxt_label">
     Latest Headlines
    </span>
   </h4>
   <ul class="nrcBlk_artList">
    <li class="nrcBlk_artHedOnly nrcBlk_art nrcBlk_art4">
     <a class="nrcTxt_headline" href="/content/2012/02/28/article/city_hosts_earth_day_recycling_contest">
      City hosts Earth Day recycling contest
     </a>
     <span class="nrcBlk_pubdate">
      <span class="nrc_sep">
       (
      </span>
      <!-- COLLAPSE WHITESPACE

                                -->
      <span class="nrc_val">
       3:53 pm
      </span>
      <!-- COLLAPSE WHITESPACE

                                -->
      <span class="nrc_sep">
       )
      </span>
     </span>
    </li>
    <li class="nrcBlk_artHedOnly nrcBlk_art nrcBlk_art5">
     <a class="nrcTxt_headline" href="/content/2012/02/28/article/got_bull_rockingham_deputies_seek_900_pound_beast">
      Got bull? Rockingham deputies seek 900-pound beast
     </a>

如果删除代码中的括号，则会出现错误：

import urllib2
import BeautifulSoup






site = "http://www.news-record.com/"

page = urllib2.urlopen("http://www.news-record.com/")
soup = BeautifulSoup.BeautifulSoup(page)

headlines = []


for headline in soup.findAll('a', {"class" : "nrcTxt_headline"}):
    url = headline.findParent('div', {"class":"nrcNav_menu"})
    print ([headline.string])

请尝试以下代码：

import urllib2
import BeautifulSoup

site = "http://www.news-record.com/"

page = urllib2.urlopen("http://www.news-record.com/")
soup = BeautifulSoup.BeautifulSoup(page)

headlines = []

headlineList = soup.findAll('a', {"class" : "nrcTxt_headline"})
for headline in headlineList:
    headlines.append(site+str(headline.get('href')))

print headlines

您将find参数包装在一个元组中，并没有尽可能地使用生成器

soup.find（'a'，（{“class”：“nrcTxt\u headline”}））

你想要

soup.find（'a'，{“class”：“nrcTxt\u headline”}）

例如：

for headline in soup.find('a', {"class" : "nrcTxt_headline"}):
    url = headline.findParent('div')['id']
    headlines.append([url, headline.string])

它给出了与我的代码相同的结果。我正在尝试获取页面上最新标题下的10条标题（href）：如果你只是想要链接。。。我编辑了上面的代码，它应该返回一个链接列表。我建议您使用lxml解析器，它不使用regexp，而是运行fester-then-BeautifulSoup。

for headline in soup.find('a', {"class" : "nrcTxt_headline"}):
    url = headline.findParent('div')['id']
    headlines.append([url, headline.string])