Html 如何解析下一页的美丽汤?
我使用下面的代码解析下一页的页面:Html 如何解析下一页的美丽汤?,html,python-3.x,web-scraping,bs4,Html,Python 3.x,Web Scraping,Bs4,我使用下面的代码解析下一页的页面: def parseNextThemeUrl(url): ret = [] ret1 = [] html = urllib.request.urlopen(url) html = BeautifulSoup(html, PARSER) html = html.find('a', class_='pager_next') if html: html = urljoin(url, html.get('href')) ret1
def parseNextThemeUrl(url):
ret = []
ret1 = []
html = urllib.request.urlopen(url)
html = BeautifulSoup(html, PARSER)
html = html.find('a', class_='pager_next')
if html:
html = urljoin(url, html.get('href'))
ret1 = parseNextThemeUrl(html)
for r in ret1:
ret.append(r)
else:
ret.append(url)
return ret
但我得到的错误如下,我如何才能解析下一个链接,如果有一个链接
Traceback (most recent call last):
html = urllib.request.urlopen(url)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 456, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
我的答案如下:
def parseNextThemeUrl(url):
urls = []
urls.append(url)
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, 'lxml')
new_page = soup.find('a', class_='pager_next')
if new_page:
new_url = urljoin(url, new_page.get('href'))
urls1 = parseNextThemeUrl(new_url)
for url1 in urls1:
urls.append(url1)
return urls
你能给我们网络链接吗?如果不了解网页,我们无法确定多少内容。
http://003.b2btoys.net/en/ProductList.aspx?Class1=12
http://003.b2btoys.net/en/ProductList.aspx?PageIndex=2&Class1=13&Class2=0&type=&keyWord=