Python 为什么HTMLPasser会丢失一些标记？_Python_Html Parsing

Python 为什么HTMLPasser会丢失一些标记？

python

Python 为什么HTMLPasser会丢失一些标记？,python,html-parsing,Python,Html Parsing,我使用HtmlPasser来计算在代码如下： class City2Parser(HTMLParser): def handle_starttag(self,tag,attrs): if tag == 'h2': print 'h2' req = urllib2.Request('http://www.worldgolf.com/courses/usa/massachusetts/') html = urllib2.urlopen(re

我使用HtmlPasser来计算在

代码如下：

class City2Parser(HTMLParser): 
    def handle_starttag(self,tag,attrs): 
        if tag == 'h2': 
            print 'h2'

req = urllib2.Request('http://www.worldgolf.com/courses/usa/massachusetts/') 
html = urllib2.urlopen(req) 
parser = City2Parser() 
parser.feed(html.read())

它只打印一次，为什么？显然，页面有三个h2标签，看看会发生什么

>>> from HTMLParser import HTMLParser
>>> import urllib2
>>> class City2Parser(HTMLParser): 
...     def handle_starttag(self,tag,attrs): 
...         if tag == 'h2': 
...             print 'h2'
... 
>>> req = urllib2.Request('http://www.worldgolf.com/courses/usa/massachusetts/') 
>>> html = urllib2.urlopen(req) 
>>> parser = City2Parser() 
>>> parser.feed(html.read())
h2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/HTMLParser.py", line 109, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 151, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 232, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 307, in check_for_whole_start_tag
    self.error("malformed start tag")
  File "/usr/lib/python2.7/HTMLParser.py", line 116, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 249, column 30

>>从HTMLParser导入HTMLParser
>>>导入urllib2
>>>类City2Parser（HTMLParser）：
...     def句柄\u开始标记（自身、标记、属性）：
...         如果标记==“h2”：
...             打印“h2”
... 
>>>req=urllib2。请求（'http://www.worldgolf.com/courses/usa/massachusetts/') 
>>>html=urlib2.urlopen（请求）
>>>parser=City2Parser（）
>>>parser.feed（html.read（））
氢
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
文件“/usr/lib/python2.7/HTMLParser.py”，第109行，在提要中
自我激励（0）
goahead中的文件“/usr/lib/python2.7/HTMLParser.py”，第151行
k=自我分析（i）
parse_starttag中的文件“/usr/lib/python2.7/HTMLParser.py”，第232行
endpos=自我检查整个启动标签（i）
文件“/usr/lib/python2.7/HTMLParser.py”，第307行，检查整个开始标记
self.error（“格式错误的开始标记”）
文件“/usr/lib/python2.7/HTMLParser.py”，第116行出错
引发HtmlPasserError（消息，self.getpos（））
HTMLParser.HTMLPARSERERROR:格式错误的开始标记，位于第249行第30列

它抱怨无效的HTML

，您必须在City2Parser
中实现一组处理程序，以处理HTMLParser似乎无法立即处理的混乱的标记和javascript。为什么不改为使用类似BeautillSoup的东西：
from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen('http://www.worldgolf.com/courses/usa/massachusetts/')
soup = BeautifulSoup(page)
s = soup.findAll('h2')

print len(s)
for t in s:
    print t.text

给出：
3
Featured Massachusetts Golf Course
Golf Locations
Latest user ratings for Massachusetts golf courses

除非关键是要使用HTMLPasser。
我的系统没有引发任何异常。@Ace:如果没有，那么您的设置就非常奇怪。当然应该。我刚刚更新了我的HTMLParser.py。我尝试了以前的版本，出现错误…@Ace:“更新了我的HTMLParser.py”？你不是在说标准库模块吗？我是说..我把它更新为