使用BeautifulSoup/Python迭代DOM
我有一个DOM:使用BeautifulSoup/Python迭代DOM,python,html,parsing,html-parsing,beautifulsoup,Python,Html,Parsing,Html Parsing,Beautifulsoup,我有一个DOM: <h2>Main Section</h2> <p>Bla bla bla<p> <h3>Subsection</h3> <p>Some more info</p> <h3>Subsection 2</h3> <p>Even more info!</p> <h2>Main Section 2</h2> <
<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>
<h3>Subsection 2</h3>
<p>Even more info!</p>
<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>
<h3>Subsection 2</h3>
<p>Even more info!</p>
我想生成一个迭代器,返回‘Main Section’、‘Bla Bla Bla’、‘Subsection’等。使用BeautifulSoup有什么方法可以做到这一点吗?这里有一种方法。其思想是迭代主要部分h2标记,对于每个h2标记,迭代同级,直到下一个h2标记:
希望这能有所帮助。这里有一种方法。其思想是迭代主要部分h2标记,对于每个h2标记,迭代同级,直到下一个h2标记:
希望这能有所帮助。注意我不能只做汤。查找所有'h2'+汤。查找所有'h3'+等等。因为我想保持标记在dom中出现的顺序。注意我不能只做汤。查找所有'h2'+汤。查找所有'h3'+等等。因为我想保持标记在dom中出现的顺序。
from bs4 import BeautifulSoup, Tag
data = """<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>
<h3>Subsection 2</h3>
<p>Even more info!</p>
<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>
<h3>Subsection 2</h3>
<p>Even more info!</p>"""
soup = BeautifulSoup(data)
for main_section in soup.find_all('h2'):
for sibling in main_section.next_siblings:
if not isinstance(sibling, Tag):
continue
if sibling.name == 'h2':
break
print sibling.text
print "-------"
Bla bla bla
Subsection
Some more info
Subsection 2
Even more info!
-------
bla
Subsection
Some more info
Subsection 2
Even more info!
-------