Python webscraper和父名称问题
我试图检索div class=“ipo单元高度”中的日期以及公司名称,如2014年2月21日和澳大利亚圣丹斯能源公司。这里是网站的链接这里是html。这段代码包含第二个div class=“genTable thin floatL”style=“width:315px”Python webscraper和父名称问题,python,web-scraping,beautifulsoup,python-3.3,Python,Web Scraping,Beautifulsoup,Python 3.3,我试图检索div class=“ipo单元高度”中的日期以及公司名称,如2014年2月21日和澳大利亚圣丹斯能源公司。这里是网站的链接这里是html。这段代码包含第二个div class=“genTable thin floatL”style=“width:315px” 您可以基于css类创建divs的列表,这是使用请求和美化组3: import requests from BeautifulSoup import BeautifulSoup req = requests.get('http:
您可以基于css类创建
div
s的列表,这是使用请求和美化组3
:
import requests
from BeautifulSoup import BeautifulSoup
req = requests.get('http://nasdaq.com/markets/ipos')
soup = BeautifulSoup(req.content)
ipo_divs = soup.findAll('div', {'class':'genTable thin floatL'})[0]
c = ipo_divs.findAll('div', {'class':'ipo-cell-height'})
ipos = {c[i].text:c[i + 1].text for i in xrange(0, len(c) - 1, 2)}
一种方法是使用值为ipo cell height
的class
属性遍历所有
元素,使用正则表达式检查其文本是否与日期匹配,然后查找下一个
元素并打印两个元素的文本
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
像这样运行:
python3 script.py
这将产生:
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/14/2014 - INOGEN INC
2/14/2014 - SEMLER SCIENTIFIC, INC.
10/9/2013 - SFX ENTERTAINMENT, INC
2/13/2014 - IIM GLOBAL CORP
2/12/2014 - Q2 HOLDINGS, INC.
2/12/2014 - RIMINI STREET, INC.
2/12/2014 - MARY FEED & SUPPLIES, INC.
2/11/2014 - 21ST CENTURY ONCOLOGY HOLDINGS, INC.
2/3/2014 - GRASSMERE ACQUISITION CORP
1/31/2014 - APTALIS HOLDINGS INC.
1/27/2014 - UNITED STATES CURRENCY FUNDS TRUST
1/22/2014 - CHRYSLER GROUP LLC
1/10/2014 - GCT SEMICONDUCTOR INC
python3 script.py
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/14/2014 - INOGEN INC
2/14/2014 - SEMLER SCIENTIFIC, INC.
10/9/2013 - SFX ENTERTAINMENT, INC
2/13/2014 - IIM GLOBAL CORP
2/12/2014 - Q2 HOLDINGS, INC.
2/12/2014 - RIMINI STREET, INC.
2/12/2014 - MARY FEED & SUPPLIES, INC.
2/11/2014 - 21ST CENTURY ONCOLOGY HOLDINGS, INC.
2/3/2014 - GRASSMERE ACQUISITION CORP
1/31/2014 - APTALIS HOLDINGS INC.
1/27/2014 - UNITED STATES CURRENCY FUNDS TRUST
1/22/2014 - CHRYSLER GROUP LLC
1/10/2014 - GCT SEMICONDUCTOR INC