Python 靓汤:访问<;李>;来自<;ul>;没有身份证
我正试图从这件事上抹去那些过生日的人 以下是现行守则:Python 靓汤:访问<;李>;来自<;ul>;没有身份证,python,html-parsing,web-scraping,beautifulsoup,Python,Html Parsing,Web Scraping,Beautifulsoup,我正试图从这件事上抹去那些过生日的人 以下是现行守则: hdr = {'User-Agent': 'Mozilla/5.0'} site = "http://en.wikipedia.org/wiki/"+"january"+"_"+"1" req = urllib2.Request(site,headers=hdr) page = urllib2.urlopen(req) soup = BeautifulSoup(page) print soup 这一切都很好,我得到了整个HTML
hdr = {'User-Agent': 'Mozilla/5.0'}
site = "http://en.wikipedia.org/wiki/"+"january"+"_"+"1"
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup
这一切都很好,我得到了整个HTML页面,但我需要特定的数据,我不知道如何在没有id的情况下使用Beautiful Soup访问这些数据。
标签没有id,
标签也没有id。另外,我不能只要求每个
标签,因为页面上还有其他列表。是否有一种特定的方式来调用给定的列表?(我不能只对这一页使用修复程序,因为我计划迭代所有日期并获得每一页的生日,而且我不能保证每一页的布局都与这一页完全相同)。查找出生部分:
section = soup.find('span', id='Births').parent
然后查找下一个无序列表:
births = section.find_next('ul').find_all('li')
这个想法是用
出生id获取span
,找到父母的下一个兄弟姐妹(即ul
),并迭代它的li
元素。下面是一个使用请求的完整示例(尽管这与此无关):
印刷品:
871 – Zwentibold, Frankish son of Arnulf of Carinthia (d. 900)
1431 – Pope Alexander VI (d. 1503)
1449 – Lorenzo de' Medici, Italian politician (d. 1492)
1467 – Sigismund I the Old, Polish king (d. 1548)
1484 – Huldrych Zwingli, Swiss pastor and theologian (d. 1531)
1511 – Henry, Duke of Cornwall (d. 1511)
1516 – Margaret Leijonhufvud, Swedish wife of Gustav I of Sweden (d. 1551)
...
希望对您有所帮助。您需要某种参考,无论是位置、id、类别等。例如,您知道该页面上的列表,它是哪个数字吗?这是一致的吗?从上的出生部分开始,通过
可以清楚地识别,每个
对应一个人(直到下一个标题)。此配方适用于维基百科上的大多数摘要页面,只需更改id值,谢谢
871 – Zwentibold, Frankish son of Arnulf of Carinthia (d. 900)
1431 – Pope Alexander VI (d. 1503)
1449 – Lorenzo de' Medici, Italian politician (d. 1492)
1467 – Sigismund I the Old, Polish king (d. 1548)
1484 – Huldrych Zwingli, Swiss pastor and theologian (d. 1531)
1511 – Henry, Duke of Cornwall (d. 1511)
1516 – Margaret Leijonhufvud, Swedish wife of Gustav I of Sweden (d. 1551)
...