Python BeautifulSoup4排除包装中的div
我试图从以下HTML结构中获取所有文本:Python BeautifulSoup4排除包装中的div,python,html,beautifulsoup,Python,Html,Beautifulsoup,我试图从以下HTML结构中获取所有文本: <div class="header"><h1>Header</h1></div> <div class="container"> <div class="header"><h1>Sub Header</h1></div> <p>Target_2</p&g
<div class="header"><h1>Header</h1></div>
<div class="container">
<div class="header"><h1>Sub Header</h1></div>
<p>Target_2</p>
<p>Target_3</p>
<p>Target_4</p>
</div>
输出:
Header
Sub Header
Target_2
Target_3
Target_4
Sub Header
我的问题是,由于Header
类的原因,“subheader”被发现了两次。
如何在容器
类中排除标题
类?
我必须用类抓取所有内容。如果标记不在其他标记内,则可以在循环内设置一个条件,使用
class=“container”
使用.find\u parent()
:
您可以将参数设置为False
,这将只查找直接子级:
from bs4 import BeautifulSoup
html = """
<div class="header"><h1>Header</h1></div>
<div class="container">
<div class="header"><h1>Sub Header</h1></div>
<p>Target_2</p>
<p>Target_3</p>
<p>Target_4</p>
</div>"""
soup = BeautifulSoup(html, "html.parser")
targets = soup.find_all("div", class_=["header", "container"], recursive=False)
for tag in targets:
print(tag.text.strip())
from bs4 import BeautifulSoup
html_doc = '''<div class="header"><h1>Header</h1></div>
<div class="container">
<div class="header"><h1>Sub Header</h1></div>
<p>Target_2</p>
<p>Target_3</p>
<p>Target_4</p>
</div>'''
soup = BeautifulSoup(html_doc, 'html.parser')
targets = soup.find_all("div", class_=["header", "container"])
for tag in targets:
if tag.find_parent(attrs={'class':'container'}):
continue
print(tag.text.strip())
Header
Sub Header
Target_2
Target_3
Target_4
from bs4 import BeautifulSoup
html = """
<div class="header"><h1>Header</h1></div>
<div class="container">
<div class="header"><h1>Sub Header</h1></div>
<p>Target_2</p>
<p>Target_3</p>
<p>Target_4</p>
</div>"""
soup = BeautifulSoup(html, "html.parser")
targets = soup.find_all("div", class_=["header", "container"], recursive=False)
for tag in targets:
print(tag.text.strip())
Header
Sub Header
Target_2
Target_3
Target_4