Python 如何将字符串转换为多级JSON？_Python_Html_Json

Python 如何将字符串转换为多级JSON？

python html json

Python 如何将字符串转换为多级JSON？,python,html,json,Python,Html,Json,我的HTML文件的结构如下：第1.1节 1.1.1随机段落 1.1.1.1随机段落第1.2节 1.2.1随机段落 1.2.1.1随机段落第11.4节… 11.4.12随机段落 11.2.12.1随机段落 HTML示例： <p> <span class="c1" >Section 1.1.<span class="c7">  </span>Organiza

我的HTML文件的结构如下：

第1.1节
1.1.1随机段落
1.1.1.1随机段落

第1.2节
1.2.1随机段落
1.2.1.1随机段落

第11.4节…
11.4.12随机段落
11.2.12.1随机段落

HTML示例：

<p>
  <span class="c1"
    >Section 1.1.<span class="c7">&nbsp;&nbsp;</span>Organization and
    Application</span
  >
</p>
<p>
  <span class="c1"
    >1.1.1.<span class="c7">&nbsp;&nbsp;</span>Organization of this Code</span
  >
</p>
<p align="justify">
  <span class="c1">1.1.1.1.&nbsp;&nbsp;Scope of Division A</span>
</p>
<p align="justify">
  <span
    ><b>(1)&nbsp;&nbsp;</b>Division A contains compliance and application
    provisions and the <i>objectives</i> and <i>functional statements</i> of
    this Code.</span
  >
</p>
<p align="justify">
  <span class="c1">1.1.1.2.&nbsp;&nbsp;Scope of Division B</span>
</p>
<p align="justify">
  <span
    ><b>(1)&nbsp;&nbsp;</b>Division B contains the
    <i>acceptable solutions</i> of this Code.</span
  >
</p>
<p align="justify">
  <span class="c1">1.1.1.3.&nbsp;&nbsp;Scope of Division C</span>
</p>

我可以为节和其中包含的HTML创建第一级“键值”对列表：

defstringtolist（string，devider）：

接受一个字符串和一个正则表达式；返回

列表[[name，resultHtml]，[name，resultHtml]]

def stringToList(string, devider):
    matches = re.finditer(devider, string)

    matchArr= []
    for   m in matches :
  
        try:
            lastMatch
        except NameError:
            x=True
        else: 
            start = lastMatch.start()
            end = m.start()    
            resultHtml = page[start:end] # html string starting with last match, ending with current match
            name = lastMatch.group().replace('class="c1">','').replace('<','') # match group from last match minus the regEx Tags 
            matchArr.append([ name, resultHtml])

        lastMatch= m 
    return matchArr   #returns list[[ name, resultHtml],[ name, resultHtml]]

最终的目标是拥有一个到每个html的嵌套链接列表。

这是实现目标的最佳途径吗？欢迎您提供任何输入或建议。

您的HTML没有正确的格式，但仍有可能在完成后获得结果。它确实需要更多的工作：

html=”“”

第1.1节组织和管理
应用


1.1.1.本规范的组织


1.1.1.1.A部分的范围


（1） A部分包含合规性和应用程序
公司的规定、目标和职能说明
这个代码。


1.1.1.2.B部分的范围


（1） B部分包含
本规范可接受的解决方案。


1.1.1.3.C部分的范围

"""
进口稀土
导入json
从bs4导入BeautifulSoup
#解析HTML以便我们可以搜索它
soup=BeautifulSoup（html）
#为输出创建占位符
输出={}
#跟踪我们嵌套的深度
最后一节\u编号=“1”
def clean_文本（文本：str）->str:
“帮助器”方法来清除文本。
Args：
文本（str）：输入文本
返回：
str：清洁输出
"""
text=文本。替换（“\u00a0”，”）
text=”“.join（[line.strip（）表示text.split（“\n”）]中的行）
返回文本
#文档的每个部分似乎都以标记开头
对于汤中的部分。查找所有（“p”）：
#如果零件是章节标题（或子章节标题），则其跨度为class=“c1”
章节标题=部分查找（“跨度”、“c1”）
如果章节标题不是“无”：
#使用正则表达式提取节号
区段编号=重新搜索（r“（\d++\）+”，区段标题.text）.group（0）.strip（“.”）
定义集合嵌套（章节编号：str，小节：dict）->dict:
“”方法向下遍历到基于
格式为字符串的节号。
Args：
截面号（str）：要遍历到的截面号（例如“1.1”）
小节（dict）：当前小节的小节
返回：
更新部分
"""
#拆分节号，将第一部分保留为主部分
main，*rest=分段编号（“.”）
#如果没有“休息”，就没有更深的层次
如果len（rest）你的HTML没有一个合适的格式，但是你仍然有可能在以后得到结果。它确实需要更多的工作：
html=”“”

第1.1节组织和管理
应用


1.1.1.本规范的组织


1.1.1.1.A部分的范围


（1） A部分包含合规性和应用程序
公司的规定、目标和职能说明
这个代码。


1.1.1.2.B部分的范围


（1） B部分包含
本规范可接受的解决方案。


1.1.1.3.C部分的范围

"""
进口稀土
导入json
从bs4导入BeautifulSoup
#解析HTML以便我们可以搜索它
soup=BeautifulSoup（html）
#为输出创建占位符
输出={}
#跟踪我们嵌套的深度
最后一节\u编号=“1”
def clean_文本（文本：str）->str:
“帮助器”方法来清除文本。
Args：
文本（str）：输入文本
返回：
str：清洁输出
"""
text=文本。替换（“\u00a0”，”）
text=”“.join（[line.strip（）表示text.split（“\n”）]中的行）
返回文本
#文档的每个部分似乎都以标记开头
对于汤中的部分。查找所有（“p”）：
#如果零件是章节标题（或子章节标题），则其跨度为class=“c1”
章节标题=部分查找（“跨度”、“c1”）
如果章节标题不是“无”：
#使用正则表达式提取节号
区段编号=重新搜索（r“（\d++\）+”，区段标题.text）.group（0）.strip（“.”）
定义集合嵌套（章节编号：str，小节：dict）->dict:
“”方法向下遍历到基于
格式为字符串的节号。
Args：
截面号（str）：要遍历到的截面号（例如“1.1”）
小节（dict）：当前小节的小节
返回：
更新部分
"""
#拆分节号，将第一部分保留为主部分
main，*rest=分段编号（“.”）
#如果没有“休息”，就没有更深的层次
如果len（休息）谢谢，我会试试看。我可以使用lxml BeautifulSoup解析器来实现相同的结果吗？当然可以，为什么不呢？您没有提供HTML示例，因此这有点难以判断，但应该可以正常工作。我添加了一个HTML代码段。我已经更新了答案，它现在可以在您提供的代码段上工作。谢谢，我将尝试它。我可以使用lxml BeautifulSoup解析器来实现相同的结果吗？当然可以，为什么不呢？您没有提供HTML示例，因此这有点难以判断，但应该可以正常工作。我添加了一个HTML代码段。我已经更新了答案，它现在可以在您提供的代码段上工作。
def stringToList(string, devider):
    matches = re.finditer(devider, string)

    matchArr= []
    for   m in matches :
  
        try:
            lastMatch
        except NameError:
            x=True
        else: 
            start = lastMatch.start()
            end = m.start()    
            resultHtml = page[start:end] # html string starting with last match, ending with current match
            name = lastMatch.group().replace('class="c1">','').replace('<','') # match group from last match minus the regEx Tags 
            matchArr.append([ name, resultHtml])

        lastMatch= m 
    return matchArr   #returns list[[ name, resultHtml],[ name, resultHtml]]

main: {
    {
        1: {
            {
                1.1: {
                      1.1.1:html,
                      1.1.2:html,
                      }
            },
        
        
        }