Python 使用getElementsByTagName进行健壮的DOM解析
以下内容(摘自“深入Python”) 失败于Python 使用getElementsByTagName进行健壮的DOM解析,python,dom,Python,Dom,以下内容(摘自“深入Python”) 失败于 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/htmlToNumEmbedded.py", line 2, in <module> xmldoc = minidom.parse('/path/to/index.html') File "/usr/lib/python2.7
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/htmlToNumEmbedded.py", line 2, in <module>
xmldoc = minidom.parse('/path/to/index.html')
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 12, column 4
但似乎有些笨拙:有没有我忽略的内置函数
或者使用getElementsByTagName进行健壮DOM解析的另一种更优雅的方法?您可以使用BeautifulSoup进行以下操作:
from bs4 import BeautifulSoup
with open('/path/to/index.html') as f:
soup = BeautifulSoup(f)
soup.find_all("img")
如果需要元素列表,请参见,而不是迭代
元素的返回值。iter
,请在其上调用list
:
from lxml import html
reflist = list(html.parse('/path/to/index.html.html').iter('img'))
from bs4 import BeautifulSoup
with open('/path/to/index.html') as f:
soup = BeautifulSoup(f)
soup.find_all("img")
from lxml import html
reflist = list(html.parse('/path/to/index.html.html').iter('img'))