如何用相应的标题类对象识别html文本？_Html_Python 3.x_Beautifulsoup

如何用相应的标题类对象识别html文本？

html python-3.x

如何用相应的标题类对象识别html文本？,html,python-3.x,beautifulsoup,Html,Python 3.x,Beautifulsoup,下面是一个html示例，但我的用例涉及不同类型的非结构化文本。将下面两个文本段落中的每个段落与其父标题（SUMMARY1）绑定（标记）的一般方法是什么？这里的标题不是真正的标题标记，只是粗体文本。我试图提取和识别文本段落及其相应的标题部分，而不管标题是否真的是标准标题或类似以下内容： <!doctype html> <html lang="en"> <head> <meta charset="utf-8"&

下面是一个html示例，但我的用例涉及不同类型的非结构化文本。将下面两个文本段落中的每个段落与其父标题（SUMMARY1）绑定（标记）的一般方法是什么？这里的标题不是真正的标题标记，只是粗体文本。我试图提取和识别文本段落及其相应的标题部分，而不管标题是否真的是标准标题或类似以下内容：

<!doctype html>
    <html lang="en">
        <head>
            <meta charset="utf-8">

            <title>Europe Test  - Some stats</title>
            <meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
    <b><location">SUMMARY1</b>
    </p>
    <p>
      This is a region in <location>Europe</location>
      where the climate is good.
    </p>
    <p>
      Total <location>Europe</location> population estimate was used back then.
    </p>

<div class="aspNetHidden"></div>
        </body>
    </html>


欧洲测试-一些统计数据

使用BeautifulSoup时，它应该类似于：
from bs4 import BeautifulSoup

html = 'your html'
soup = BeautifulSoup(html)
header = soup.find('b')
print(header.text)
first_paragraph = header.findNext('p')
print(first_paragraph.text)
second_paragraph = first_paragraph.findNext('p')
print(second_paragraph.text)

我最初考虑使用，但未能找到一种方法，将摘要1
作为“摘要”或“描述”的唯一部分，或作为结果文章
对象的任何其他部分。在任何情况下，检查这个模块-可能真的帮助您解析HTML文章
但是，如果使用BeautifulSoup
，您可以首先找到标题，然后使用以下内容获取下一个p
元素：
您也可以这样做：
from bs4 import BeautifulSoup

content = """
<html>
    <div>
        <p>
            <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
        </p>
        <p>
            This is a region in <location>Europe</location>
            where the climate is good.
        </p>
        <p>
            Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
        </p>
    </div>
</html>
"""
soup = BeautifulSoup(content, "lxml")
items = soup.select("b")[0]
paragraphs = ' '.join([' '.join(data.text.split()) for data in items.find_parent().find_next_siblings()])
print({items.text : paragraphs})

你能为这类文章共享一个URL吗？我可能有一个现成的想法：）谢谢你的快速回复。不幸的是，我没有url，我的出发点是这种存储在文件中的html文本。没问题，你能提供这种文章的完整html源代码吗？我想你应该尝试将其包含在问题本身中，或者如果由于内容长度限制，它不允许这样做，请暂时使用像pastebin这样的第三方。谢谢。@alecxe，更新了原始问题中的html，谢谢！是否可以对其进行概括，以便识别所有段落和相应的父项，而不管父项是粗体标题还是标准标记标题或其他内容？另外，反向查找怎么样？如中所示，给定段落的任何一行，确定其父项（标题）？谢谢
{'SUMMARY1': [
     'This is a region in Europe where the climate is good.', 
     'Total Europe population estimate was used back then.']}

from bs4 import BeautifulSoup

content = """
<html>
    <div>
        <p>
            <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
        </p>
        <p>
            This is a region in <location>Europe</location>
            where the climate is good.
        </p>
        <p>
            Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
        </p>
    </div>
</html>
"""
soup = BeautifulSoup(content, "lxml")
items = soup.select("b")[0]
paragraphs = ' '.join([' '.join(data.text.split()) for data in items.find_parent().find_next_siblings()])
print({items.text : paragraphs})

{'SUMMARY1': 'This is a region in Europe where the climate is good. Total Europe population estimate was used back then.'}