Python 使用BeautifulSoup将HTML转换为JSON
我尝试使用以下代码,使用BeautifulSoup将特定HTML结构解析为JSON:Python 使用BeautifulSoup将HTML转换为JSON,python,html,json,parsing,beautifulsoup,Python,Html,Json,Parsing,Beautifulsoup,我尝试使用以下代码,使用BeautifulSoup将特定HTML结构解析为JSON: from bs4 import BeautifulSoup html = """<h1>Heading</h1> <h1>More heading</h1> <p>test</p> <h2>Section</h2> <p>a.b.c</p>
from bs4 import BeautifulSoup
html = """<h1>Heading</h1>
<h1>More heading</h1>
<p>test</p>
<h2>Section</h2>
<p>a.b.c</p>
<h3>Prio</h3>
<p>Medium</p>
<h3>Description</h3>
<p>Description 1</p>
<p>Description 2</p>
<h3>Foo</h3>
<p>Foo 1</p>
<p>Foo 2</p>
<h3>Bar</h3>
<p>Bar 1</p>
<p>Bar 2</p>
<p>Bar 3</p>
<h3>Baz</h3>
<p>Baz 1</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>High</p>
<h3>Description</h3>
<p>Description 3</p>
<p>Description 4</p>
<h3>Foo</h3>
<p>Foo 3</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>Low</p>
<h3>Description</h3>
<p>Description 5</p>
<p>Description 6</p>
<h3>Foo</h3>
<p>Foo 4</p>
<p>Foo 5</p>
<h3>Bar</h3>
<p>Bar 4</p>
<p>Bar 5</p>
<h3>Baz</h3>
<p>Baz 2</p>
<h2>Section</h2>
<h3>Prio</h3>
<p>Medium</p>
<h3>Description</h3>
<p>Description 7</p>
<h3>Foo</h3>
<p>Foo 6</p>
<h3>Bar</h3>
<p>Bar 6</p>
<h3>Baz</h3>
<p>Baz 3</p>"""
json = {}
data = []
soup = BeautifulSoup(html, 'lxml')
json['Category'] = soup.find('h1').string
for section in soup.find_all('h2'):
p = ''
content = {}
for sibling in section.next_siblings:
if sibling.name == 'h3':
prev_section = sibling.find_previous_sibling('h3')
if prev_section:
if not prev_section.text == 'Baz' and not prev_section.text == 'Bar':
content[prev_section.text] = p
p = ''
if sibling.name == 'p':
if not p:
p = sibling.text
else:
p = p + '\n' + sibling.text
elif sibling.name == 'h2':
data.append(content)
content = {}
p = ''
json['Data'] = data
print(json)
但是,上面的代码让我明白了这一点:
{'Category': 'Heading', 'Data': [{'Prio': 'Medium', 'Description': 'Description 1\nDescription 2', 'Foo': 'Foo 1\nFoo 2'}, {'Prio': 'High', 'Description': 'Description 3\nDescription 4'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}, {'Prio': 'High', 'Description': 'Description 3\nDescription 4'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}, {'Foo': 'Foo 4\nFoo 5', 'Prio': 'Low', 'Description': 'Description 5\nDescription 6'}]}
因此,基本上,要点是得到第一个h2,解析它后面的所有内容,同时根据h3值对它进行分段,直到找到下一个h2值。我真的不知道如何在BeautifulSoup做到这一点。
任何指向正确方向的指针都将不胜感激 您需要一种方法来保持区段计数(通过
h2
)。您需要一个每次处理一个节的循环(通过同级筛选和iftag.name==h2
,以避免循环,因为此时无法从self引用同级),以及一种跟踪您是否处于h3
并且需要添加到当前字典的键和值的方法。如果该键已存在,则需要将段落添加到该键的值中。您还需要过滤掉条
和baz
(通过:not
:contains
)和h2+p
from bs4 import BeautifulSoup as bs
html = """your html"""
soup = bs(html, 'lxml')
total_sections = len(soup.select('h2'))
result = {}
result['Category'] = soup.select_one('h1').text
data = []
for i in range(1, total_sections + 1):
temp = {}
for j in soup.select(f'h2:nth-of-type({i}) ~ *:not(h2 + p, :contains("Bar"), :contains("Baz"), h2:nth-of-type({i + 1}), h2:nth-of-type({i + 1}) ~ *)'):
if j.name == 'h2':
break
item = j.next_sibling.strip()
if j.name == 'h3':
flag = j.text
temp[flag] = item
else:
temp[flag] += ' ' + j.text
data.append(temp)
result['Data'] = data
print(result)
实际上,elif
并不是真正需要的,可以替换为:
else:
temp[flag] += ' ' + j.text
else:
temp[flag] += ' ' + j.text