Python BeautifulSoup4排除包装中的div

Python BeautifulSoup4排除包装中的div,python,html,beautifulsoup,Python,Html,Beautifulsoup,我试图从以下HTML结构中获取所有文本: <div class="header"><h1>Header</h1></div> <div class="container"> <div class="header"><h1>Sub Header</h1></div> <p>Target_2</p&g

我试图从以下HTML结构中获取所有文本:

<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>
输出:

Header
Sub Header
Target_2
Target_3
Target_4
Sub Header
我的问题是,由于
Header
类的原因,“subheader”被发现了两次。 如何在
容器
类中排除
标题
类?
我必须用类抓取所有内容。

如果标记不在其他标记内,则可以在循环内设置一个条件,使用
class=“container”
使用
.find\u parent()

您可以将参数设置为
False
,这将只查找直接子级:

from bs4 import BeautifulSoup


html = """
<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>"""

soup = BeautifulSoup(html, "html.parser")
targets = soup.find_all("div", class_=["header", "container"], recursive=False)

for tag in targets:
    print(tag.text.strip())
from bs4 import BeautifulSoup


html_doc = '''<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>'''

soup = BeautifulSoup(html_doc, 'html.parser')
targets = soup.find_all("div", class_=["header", "container"])

for tag in targets:
    if tag.find_parent(attrs={'class':'container'}):
        continue
    print(tag.text.strip())
Header
Sub Header
Target_2
Target_3
Target_4
from bs4 import BeautifulSoup


html = """
<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>"""

soup = BeautifulSoup(html, "html.parser")
targets = soup.find_all("div", class_=["header", "container"], recursive=False)

for tag in targets:
    print(tag.text.strip())
Header
Sub Header
Target_2
Target_3
Target_4