将HTML(无序)列表转换为嵌套Python字典
如果我有一个嵌套的HTML(无序)列表,如下所示:将HTML(无序)列表转换为嵌套Python字典,python,web-scraping,beautifulsoup,html-parsing,Python,Web Scraping,Beautifulsoup,Html Parsing,如果我有一个嵌套的HTML(无序)列表,如下所示: <<ul style=""> <li class="jstree-last jstree-open" id="wfo-7000000004"> <ins class="jstree-icon"> </ins> <a class="" href="taxon/wfo-7000000004"> <ins class="jstree-icon"&g
<<ul style="">
<li class="jstree-last jstree-open" id="wfo-7000000004">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-7000000004">
<ins class="jstree-icon"> </ins>
Acoraceae
</a>
<ul style="">
<li class="jstree-last jstree-open" id="wfo-4000000350">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-4000000350">
<ins class="jstree-icon"> </ins>
Acorus
</a>
<ul style="">
<li class="jstree-open" id="wfo-0000350733">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350733">
<ins class="jstree-icon"> </ins>
Acorus calamus
</a>
<ul style="">
<li class="jstree-leaf" id="wfo-0000350841">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350841">
<ins class="jstree-icon"> </ins>
Acorus calamus var. americanus
</a>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000350949">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350949">
<ins class="jstree-icon"> </ins>
Acorus calamus var. angustatus
</a>
</li>
</ul>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000352676">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000352676">
<ins class="jstree-icon"> </ins>
Acorus gramineus
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
我假设像Beautiful Soup和HTML Parser这样的库有实现这一点的工具(在python中使用for循环),但我还没有弄明白。谢谢你的帮助
我试着这样做:
def create_dic(soup):
return {li.a.get_text().replace("\xa0", ""): create_dic(li)
for ul in soup('ul', recursive=False)
for li in ul('li', recursive=False)}
然而,输出是这样的(其中美洲菖蒲变种和安格斯菖蒲变种应在列表中,而禾本科菖蒲不是字典):
我将回答这个问题,因为要从工作中得到答案,您必须调用beautifulsoup来解析您的html uls。我还将问题标记为重复,因此如果重复,请关闭/删除
from bs4 import BeautifulSoup
htmlbody = '''
<<ul style="">
<li class="jstree-last jstree-open" id="wfo-7000000004">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-7000000004">
<ins class="jstree-icon"> </ins>
Acoraceae
</a>
<ul style="">
<li class="jstree-last jstree-open" id="wfo-4000000350">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-4000000350">
<ins class="jstree-icon"> </ins>
Acorus
</a>
<ul style="">
<li class="jstree-open" id="wfo-0000350733">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350733">
<ins class="jstree-icon"> </ins>
Acorus calamus
</a>
<ul style="">
<li class="jstree-leaf" id="wfo-0000350841">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350841">
<ins class="jstree-icon"> </ins>
Acorus calamus var. americanus
</a>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000350949">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350949">
<ins class="jstree-icon"> </ins>
Acorus calamus var. angustatus
</a>
</li>
</ul>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000352676">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000352676">
<ins class="jstree-icon"> </ins>
Acorus gramineus
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
'''
def ul_to_dict(ul):
result = {}
for li in ul.find_all("li", recursive=False):
key = next(li.stripped_strings)
ul = li.find("ul")
if ul:
result[key] = ul_to_dict(ul)
else:
result[key] = None
return result
# Let BeautifulSoup do it's magic and parse ul from the HTML.
htmlbody = BeautifulSoup(htmlbody).ul
# run our function
ul_to_dict(htmlbody)
从bs4导入美化组
htmlbody=''
我将回答这个问题,因为要从工作中得到答案,您必须调用beautifulsoup来解析您的html uls。我还将问题标记为重复,因此如果重复,请关闭/删除
from bs4 import BeautifulSoup
htmlbody = '''
<<ul style="">
<li class="jstree-last jstree-open" id="wfo-7000000004">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-7000000004">
<ins class="jstree-icon"> </ins>
Acoraceae
</a>
<ul style="">
<li class="jstree-last jstree-open" id="wfo-4000000350">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-4000000350">
<ins class="jstree-icon"> </ins>
Acorus
</a>
<ul style="">
<li class="jstree-open" id="wfo-0000350733">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350733">
<ins class="jstree-icon"> </ins>
Acorus calamus
</a>
<ul style="">
<li class="jstree-leaf" id="wfo-0000350841">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350841">
<ins class="jstree-icon"> </ins>
Acorus calamus var. americanus
</a>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000350949">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350949">
<ins class="jstree-icon"> </ins>
Acorus calamus var. angustatus
</a>
</li>
</ul>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000352676">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000352676">
<ins class="jstree-icon"> </ins>
Acorus gramineus
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
'''
def ul_to_dict(ul):
result = {}
for li in ul.find_all("li", recursive=False):
key = next(li.stripped_strings)
ul = li.find("ul")
if ul:
result[key] = ul_to_dict(ul)
else:
result[key] = None
return result
# Let BeautifulSoup do it's magic and parse ul from the HTML.
htmlbody = BeautifulSoup(htmlbody).ul
# run our function
ul_to_dict(htmlbody)
从bs4导入美化组
htmlbody=''
这个答案是否回答了你的问题?这个答案是否回答了你的问题?
from bs4 import BeautifulSoup
htmlbody = '''
<<ul style="">
<li class="jstree-last jstree-open" id="wfo-7000000004">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-7000000004">
<ins class="jstree-icon"> </ins>
Acoraceae
</a>
<ul style="">
<li class="jstree-last jstree-open" id="wfo-4000000350">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-4000000350">
<ins class="jstree-icon"> </ins>
Acorus
</a>
<ul style="">
<li class="jstree-open" id="wfo-0000350733">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350733">
<ins class="jstree-icon"> </ins>
Acorus calamus
</a>
<ul style="">
<li class="jstree-leaf" id="wfo-0000350841">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350841">
<ins class="jstree-icon"> </ins>
Acorus calamus var. americanus
</a>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000350949">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350949">
<ins class="jstree-icon"> </ins>
Acorus calamus var. angustatus
</a>
</li>
</ul>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000352676">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000352676">
<ins class="jstree-icon"> </ins>
Acorus gramineus
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
'''
def ul_to_dict(ul):
result = {}
for li in ul.find_all("li", recursive=False):
key = next(li.stripped_strings)
ul = li.find("ul")
if ul:
result[key] = ul_to_dict(ul)
else:
result[key] = None
return result
# Let BeautifulSoup do it's magic and parse ul from the HTML.
htmlbody = BeautifulSoup(htmlbody).ul
# run our function
ul_to_dict(htmlbody)