Html 在美丽的汤中遵循等级制度_Html_Parsing_Beautifulsoup

Html 在美丽的汤中遵循等级制度

html parsing

Html 在美丽的汤中遵循等级制度,html,parsing,beautifulsoup,Html,Parsing,Beautifulsoup,我有一个HTML文件，格式如下： <div class="entry"> <p>para1</p> <p><a href="www.site.com">para2</a></p> <p><div class="abc"> Ignore this part1</div> </p> <p><script class="xyz">Ignore t

我有一个HTML文件，格式如下：

<div class="entry"> 
<p>para1</p>
<p><a href="www.site.com">para2</a></p>
<p><div class="abc"> Ignore this part1</div> </p>
<p><script class="xyz">Ignore this part2 </script></p>
</div>


帕拉1

忽略这一部分1
忽略这一部分2

假设只有一个类值为“entry”的div标记。我想打印那些p标记中的所有文本，这些p标记位于类值为“entry”的div标记中，除了那些后面跟div或script标记的p标记之外。所以这里我想打印“para1”和“para2”，但不是“忽略此部分1”和“忽略此部分2”

如何使用beautiful soup实现这一点？

使用

lambda

表达式过滤您不需要的内容

例如：

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup


example = """<div class="entry"> 
<p>para1</p>
<p><a href="www.site.com">para2</a></p>
<p><div class="abc"> Ignore this part1</div> </p>
<p><script class="xyz">Ignore this part2 </script></p>
<p>example para</p>
</div>"""

soup = BeautifulSoup(example, 'html.parser')
entry = soup.find('div', class_="entry")
p = entry.find_all(lambda tag: tag.name == "p" and not (tag.find("div") 
or tag.find("script")))
for content in p:
    print (content.get_text(strip=True))

para1
para2
example para