Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/90.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何用相应的标题类对象识别html文本?_Html_Python 3.x_Beautifulsoup - Fatal编程技术网

如何用相应的标题类对象识别html文本?

如何用相应的标题类对象识别html文本?,html,python-3.x,beautifulsoup,Html,Python 3.x,Beautifulsoup,下面是一个html示例,但我的用例涉及不同类型的非结构化文本。将下面两个文本段落中的每个段落与其父标题(SUMMARY1)绑定(标记)的一般方法是什么?这里的标题不是真正的标题标记,只是粗体文本。我试图提取和识别文本段落及其相应的标题部分,而不管标题是否真的是标准标题或类似以下内容: <!doctype html> <html lang="en"> <head> <meta charset="utf-8"&

下面是一个html示例,但我的用例涉及不同类型的非结构化文本。将下面两个文本段落中的每个段落与其父标题(SUMMARY1)绑定(标记)的一般方法是什么?这里的标题不是真正的标题标记,只是粗体文本。我试图提取和识别文本段落及其相应的标题部分,而不管标题是否真的是标准标题或类似以下内容:

<!doctype html>
    <html lang="en">
        <head>
            <meta charset="utf-8">

            <title>Europe Test  - Some stats</title>
            <meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
    <b><location">SUMMARY1</b>
    </p>
    <p>
      This is a region in <location>Europe</location>
      where the climate is good.
    </p>
    <p>
      Total <location>Europe</location> population estimate was used back then.
    </p>

<div class="aspNetHidden"></div>
        </body>
    </html>

欧洲测试-一些统计数据


使用BeautifulSoup时,它应该类似于:

from bs4 import BeautifulSoup

html = 'your html'
soup = BeautifulSoup(html)
header = soup.find('b')
print(header.text)
first_paragraph = header.findNext('p')
print(first_paragraph.text)
second_paragraph = first_paragraph.findNext('p')
print(second_paragraph.text)

我最初考虑使用,但未能找到一种方法,将
摘要1
作为“摘要”或“描述”的唯一部分,或作为结果
文章
对象的任何其他部分。在任何情况下,检查这个模块-可能真的帮助您解析HTML文章

但是,如果使用
BeautifulSoup
,您可以首先找到标题,然后使用以下内容获取下一个
p
元素:


您也可以这样做:

from bs4 import BeautifulSoup

content = """
<html>
    <div>
        <p>
            <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
        </p>
        <p>
            This is a region in <location>Europe</location>
            where the climate is good.
        </p>
        <p>
            Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
        </p>
    </div>
</html>
"""
soup = BeautifulSoup(content, "lxml")
items = soup.select("b")[0]
paragraphs = ' '.join([' '.join(data.text.split()) for data in items.find_parent().find_next_siblings()])
print({items.text : paragraphs})

你能为这类文章共享一个URL吗?我可能有一个现成的想法:)谢谢你的快速回复。不幸的是,我没有url,我的出发点是这种存储在文件中的html文本。没问题,你能提供这种文章的完整html源代码吗?我想你应该尝试将其包含在问题本身中,或者如果由于内容长度限制,它不允许这样做,请暂时使用像pastebin这样的第三方。谢谢。@alecxe,更新了原始问题中的html,谢谢!是否可以对其进行概括,以便识别所有段落和相应的父项,而不管父项是粗体标题还是标准标记标题或其他内容?另外,反向查找怎么样?如中所示,给定段落的任何一行,确定其父项(标题)?谢谢
{'SUMMARY1': [
     'This is a region in Europe where the climate is good.', 
     'Total Europe population estimate was used back then.']}
from bs4 import BeautifulSoup

content = """
<html>
    <div>
        <p>
            <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
        </p>
        <p>
            This is a region in <location>Europe</location>
            where the climate is good.
        </p>
        <p>
            Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
        </p>
    </div>
</html>
"""
soup = BeautifulSoup(content, "lxml")
items = soup.select("b")[0]
paragraphs = ' '.join([' '.join(data.text.split()) for data in items.find_parent().find_next_siblings()])
print({items.text : paragraphs})
{'SUMMARY1': 'This is a region in Europe where the climate is good. Total Europe population estimate was used back then.'}