Python 使用BeautifulSoup将HTML转换为JSON_Python_Html_Json_Parsing_Beautifulsoup

Python 使用BeautifulSoup将HTML转换为JSON

python html json parsing

Python 使用BeautifulSoup将HTML转换为JSON,python,html,json,parsing,beautifulsoup,Python,Html,Json,Parsing,Beautifulsoup,我尝试使用以下代码，使用BeautifulSoup将特定HTML结构解析为JSON： from bs4 import BeautifulSoup html = """<h1>Heading</h1> <h1>More heading</h1> <p>test</p> <h2>Section</h2> <p>a.b.c</p>

我尝试使用以下代码，使用BeautifulSoup将特定HTML结构解析为JSON：

    from bs4 import BeautifulSoup
    
    html = """<h1>Heading</h1>
<h1>More heading</h1>
<p>test</p>
<h2>Section</h2>
<p>a.b.c</p>
<h3>Prio</h3>
<p>Medium</p>
<h3>Description</h3>
<p>Description 1</p>
<p>Description 2</p>
<h3>Foo</h3>
<p>Foo 1</p>
<p>Foo 2</p>
<h3>Bar</h3>
<p>Bar 1</p>
<p>Bar 2</p>
<p>Bar 3</p>
<h3>Baz</h3>
<p>Baz 1</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>High</p>
<h3>Description</h3>
<p>Description 3</p>
<p>Description 4</p>
<h3>Foo</h3>
<p>Foo 3</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>Low</p>
<h3>Description</h3>
<p>Description 5</p>
<p>Description 6</p>
<h3>Foo</h3>
<p>Foo 4</p>
<p>Foo 5</p>
<h3>Bar</h3>
<p>Bar 4</p>
<p>Bar 5</p>
<h3>Baz</h3>
<p>Baz 2</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>Medium</p>
<h3>Description</h3>
<p>Description 7</p>
<h3>Foo</h3>
<p>Foo 6</p>
<h3>Bar</h3>
<p>Bar 6</p>
<h3>Baz</h3>
<p>Baz 3</p>"""
    
    json = {}
    data = []
    soup = BeautifulSoup(html, 'lxml')
    json['Category'] = soup.find('h1').string
        
    for section in soup.find_all('h2'):
        p = ''
        content = {}
        for sibling in section.next_siblings:
            if sibling.name == 'h3':
                prev_section = sibling.find_previous_sibling('h3')
                if prev_section:
                    if not prev_section.text == 'Baz' and not prev_section.text == 'Bar':
                        content[prev_section.text] = p
                p = ''
            if sibling.name == 'p':
                if not p:
                    p = sibling.text
                else:
                    p = p + '\n' + sibling.text
            elif sibling.name == 'h2':
                data.append(content)
                content = {}
                p = ''
    json['Data'] = data
    print(json)

但是，上面的代码让我明白了这一点：

{'Category': 'Heading', 'Data': [{'Prio': 'Medium', 'Description': 'Description 1\nDescription 2', 'Foo': 'Foo 1\nFoo 2'}, {'Prio': 'High', 'Description': 'Description 3\nDescription 4'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}, {'Prio': 'High', 'Description': 'Description 3\nDescription 4'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}]}

因此，基本上，要点是得到第一个h2，解析它后面的所有内容，同时根据h3值对它进行分段，直到找到下一个h2值。我真的不知道如何在BeautifulSoup做到这一点。

任何指向正确方向的指针都将不胜感激

您需要一种方法来保持区段计数（通过

h2

）。您需要一个每次处理一个节的循环（通过同级筛选和if

tag.name==h2

，以避免循环，因为此时无法从self引用同级），以及一种跟踪您是否处于

h3

并且需要添加到当前字典的键和值的方法。如果该键已存在，则需要将段落添加到该键的值中。您还需要过滤掉

条

和

baz

（通过

:not

:contains

）和

h2+p

from bs4 import BeautifulSoup as bs
    
html = """your html"""
soup = bs(html, 'lxml')
total_sections = len(soup.select('h2'))
result = {}
result['Category'] = soup.select_one('h1').text
data = []

 for i in range(1, total_sections + 1): 
    temp = {}

    for j in soup.select(f'h2:nth-of-type({i}) ~ *:not(h2 + p, :contains("Bar"), :contains("Baz"), h2:nth-of-type({i + 1}), h2:nth-of-type({i + 1}) ~ *)'):
        if j.name == 'h2':
            break
        item = j.next_sibling.strip()

        if j.name == 'h3':  
            flag = j.text
            temp[flag] = item
        else:
            temp[flag] += ' ' + j.text
    data.append(temp)
result['Data'] = data
print(result)

实际上，

elif

并不是真正需要的，可以替换为：

else:
    temp[flag] += ' ' + j.text

else:
    temp[flag] += ' ' + j.text