Python 使用BeautifulSoup将HTML文档剪切/切片成碎片？_Python_Html_Beautifulsoup_Html Parsing

Python 使用BeautifulSoup将HTML文档剪切/切片成碎片？

python html

Python 使用BeautifulSoup将HTML文档剪切/切片成碎片？,python,html,beautifulsoup,html-parsing,Python,Html,Beautifulsoup,Html Parsing,我有一个HTML文档，如下所示： <h1> Name of Article </h2> <p>First Paragraph I want</p> <p>More Html I'm interested in</p> <h2> Subheading in the article I also want </h2> <p>Even more Html i want to pull out

我有一个HTML文档，如下所示：

<h1> Name of Article </h2> 
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2> 
<p>Html I do not want...</p>

我不想得到h2标记的列表，我想在第二个h2标记处切分文档，并将上述内容保留在一个新变量中。基本上，我想要的输出是：

<h1> Name of Article </h2> 
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>

文章名称
我想要的第一段
更多我感兴趣的Html
这篇文章的副标题我也要
我想从文档中提取更多Html

对HTML文档进行“切片”/“剪切”，而不是简单地查找标记并输出标记本身，最好的方法是什么

您可以选择“References”元素的每个同级元素和元素本身：

import re
from bs4 import BeautifulSoup

data = """
<div>
    <h1> Name of Article </h2>
    <p>First Paragraph I want</p>
    <p>More Html I'm interested in</p>
    <h2> Subheading in the article I also want </h2>
    <p>Even more Html i want to pull out of the document.</p>
    <h2> References </h2>
    <p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")

references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
    elm.extract()
references.extract()

print(soup)

重新导入
从bs4导入BeautifulSoup
data=”“”
品名
我想要的第一段
更多我感兴趣的Html
这篇文章的副标题我也要
我想从文档中提取更多Html
工具书类
我不想
"""
汤=美汤（数据，“lxml”）
references=soup.find（“h2”，text=re.compile（“references”））
对于引用中的elm。查找下一个兄弟姐妹（）
榆树提取物（）
references.extract（）
印花（汤）

印刷品：

<div>
    <h1> Name of Article</h1>
    <p>First Paragraph I want</p>
    <p>More Html I'm interested in</p>
    <h2> Subheading in the article I also want </h2>
    <p>Even more Html i want to pull out of the document.</p>
</div>


品名
我想要的第一段
更多我感兴趣的Html
这篇文章的副标题我也要
我想从文档中提取更多Html

您可以在字符串中找到

h2

的位置，然后通过它找到子字符串：

last_h2_tag = str(soup.find_all("h2")[-1]) 
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]

我想这应该行得通，谢谢！！非常感谢。非常干净。为了确保我清楚地理解它，需要执行elm.extract（）的循环来删除h2标记中的所有html，对吗？然后，end references.extract（）只需在从中提取所有内容后删除'references'h2标记？@EazyC我希望这会有所帮助。在循环中，我们删除References元素的下一个同级，然后删除References元素本身。。

last_h2_tag = str(soup.find_all("h2")[-1]) 
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]