Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用BeautifulSoup将HTML转换为JSON_Python_Html_Json_Parsing_Beautifulsoup - Fatal编程技术网

Python 使用BeautifulSoup将HTML转换为JSON

Python 使用BeautifulSoup将HTML转换为JSON,python,html,json,parsing,beautifulsoup,Python,Html,Json,Parsing,Beautifulsoup,我尝试使用以下代码,使用BeautifulSoup将特定HTML结构解析为JSON: from bs4 import BeautifulSoup html = """<h1>Heading</h1> <h1>More heading</h1> <p>test</p> <h2>Section</h2> <p>a.b.c</p>

我尝试使用以下代码,使用BeautifulSoup将特定HTML结构解析为JSON:

    from bs4 import BeautifulSoup
    
    html = """<h1>Heading</h1>
<h1>More heading</h1>
<p>test</p>
<h2>Section</h2>
<p>a.b.c</p>
<h3>Prio</h3>
<p>Medium</p>
<h3>Description</h3>
<p>Description 1</p>
<p>Description 2</p>
<h3>Foo</h3>
<p>Foo 1</p>
<p>Foo 2</p>
<h3>Bar</h3>
<p>Bar 1</p>
<p>Bar 2</p>
<p>Bar 3</p>
<h3>Baz</h3>
<p>Baz 1</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>High</p>
<h3>Description</h3>
<p>Description 3</p>
<p>Description 4</p>
<h3>Foo</h3>
<p>Foo 3</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>Low</p>
<h3>Description</h3>
<p>Description 5</p>
<p>Description 6</p>
<h3>Foo</h3>
<p>Foo 4</p>
<p>Foo 5</p>
<h3>Bar</h3>
<p>Bar 4</p>
<p>Bar 5</p>
<h3>Baz</h3>
<p>Baz 2</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>Medium</p>
<h3>Description</h3>
<p>Description 7</p>
<h3>Foo</h3>
<p>Foo 6</p>
<h3>Bar</h3>
<p>Bar 6</p>
<h3>Baz</h3>
<p>Baz 3</p>"""
    
    json = {}
    data = []
    soup = BeautifulSoup(html, 'lxml')
    json['Category'] = soup.find('h1').string
        
    for section in soup.find_all('h2'):
        p = ''
        content = {}
        for sibling in section.next_siblings:
            if sibling.name == 'h3':
                prev_section = sibling.find_previous_sibling('h3')
                if prev_section:
                    if not prev_section.text == 'Baz' and not prev_section.text == 'Bar':
                        content[prev_section.text] = p
                p = ''
            if sibling.name == 'p':
                if not p:
                    p = sibling.text
                else:
                    p = p + '\n' + sibling.text
            elif sibling.name == 'h2':
                data.append(content)
                content = {}
                p = ''
    json['Data'] = data
    print(json)
但是,上面的代码让我明白了这一点:

{'Category': 'Heading', 'Data': [{'Prio': 'Medium', 'Description': 'Description 1\nDescription 2', 'Foo': 'Foo 1\nFoo 2'}, {'Prio': 'High', 'Description': 'Description 3\nDescription 4'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}, {'Prio': 'High', 'Description': 'Description 3\nDescription 4'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}]}
因此,基本上,要点是得到第一个h2,解析它后面的所有内容,同时根据h3值对它进行分段,直到找到下一个h2值。我真的不知道如何在BeautifulSoup做到这一点。
任何指向正确方向的指针都将不胜感激

您需要一种方法来保持区段计数(通过
h2
)。您需要一个每次处理一个节的循环(通过同级筛选和if
tag.name==h2
,以避免循环,因为此时无法从self引用同级),以及一种跟踪您是否处于
h3
并且需要添加到当前字典的键和值的方法。如果该键已存在,则需要将段落添加到该键的值中。您还需要过滤掉
baz
(通过
:not
:contains
)和
h2+p

from bs4 import BeautifulSoup as bs
    
html = """your html"""
soup = bs(html, 'lxml')
total_sections = len(soup.select('h2'))
result = {}
result['Category'] = soup.select_one('h1').text
data = []

 for i in range(1, total_sections + 1): 
    temp = {}

    for j in soup.select(f'h2:nth-of-type({i}) ~ *:not(h2 + p, :contains("Bar"), :contains("Baz"), h2:nth-of-type({i + 1}), h2:nth-of-type({i + 1}) ~ *)'):
        if j.name == 'h2':
            break
        item = j.next_sibling.strip()

        if j.name == 'h3':  
            flag = j.text
            temp[flag] = item
        else:
            temp[flag] += ' ' + j.text
    data.append(temp)
result['Data'] = data
print(result)
实际上,
elif
并不是真正需要的,可以替换为:

else:
    temp[flag] += ' ' + j.text
else:
    temp[flag] += ' ' + j.text