Python 处理给BeautifulSoup的坏链接的最佳方法是什么?
我正在研究从delicious中提取URL,然后使用这些URL来发现相关的提要 然而,delicious中的一些书签不是html链接,会导致BS呕吐。基本上,我想扔掉一个链接,如果BS获取它,它看起来不像html 现在,这就是我得到的Python 处理给BeautifulSoup的坏链接的最佳方法是什么?,python,parsing,beautifulsoup,Python,Parsing,Beautifulsoup,我正在研究从delicious中提取URL,然后使用这些URL来发现相关的提要 然而,delicious中的一些书签不是html链接,会导致BS呕吐。基本上,我想扔掉一个链接,如果BS获取它,它看起来不像html 现在,这就是我得到的 trillian:Documents jauderho$ ./d2o.py "green data center" processing http://www.greenm3.com/ processing http://www.eweek.com/c/a/Gr
trillian:Documents jauderho$ ./d2o.py "green data center"
processing http://www.greenm3.com/
processing http://www.eweek.com/c/a/Green-IT/How-to-Create-an-EnergyEfficient-Green-Data-Center/?kc=rss
Traceback (most recent call last):
File "./d2o.py", line 53, in <module>
get_feed_links(d_links)
File "./d2o.py", line 43, in get_feed_links
soup = BeautifulSoup(html)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 314, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u'</b />', at line 739, column 1
我只需包装我的BeautifulSoup处理并查找
HTMLParser.htmlParserError
异常
import HTMLParser,BeautifulSoup
try:
soup = BeautifulSoup.BeautifulSoup(raw_html)
for a in soup.findAll('a'):
href = a.['href']
....
except HTMLParser.HTMLParseError:
print "failed to parse",url
但除此之外,您还可以在对页面进行爬网时检查响应的内容类型,并确保在您尝试解析页面之前,它类似于text/html
或application/xml+xhtml
或类似的内容。这应该可以避免大多数错误
import HTMLParser,BeautifulSoup
try:
soup = BeautifulSoup.BeautifulSoup(raw_html)
for a in soup.findAll('a'):
href = a.['href']
....
except HTMLParser.HTMLParseError:
print "failed to parse",url