Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/351.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用BeautifulSoup将html表解析为python字典_Python_Html_Dictionary_Html Parsing_Beautifulsoup - Fatal编程技术网

使用BeautifulSoup将html表解析为python字典

使用BeautifulSoup将html表解析为python字典,python,html,dictionary,html-parsing,beautifulsoup,Python,Html,Dictionary,Html Parsing,Beautifulsoup,这是一段html代码,我正试图用BeautifulSoup解析它: <table> <tr> <th width="100">menu1</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>Some data

这是一段html代码,我正试图用BeautifulSoup解析它:

<table>
          <tr>
            <th width="100">menu1</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data1</li>
                    <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                    ... (amount of this tags isn't fixed)
              </ul>
            </td>
          </tr>
          <tr>
            <th width="100">menu2</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data2</li>
                    <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                    <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                    <li>Some data3</li>
                    ... (amount of this tags isn't fixed too)
              </ul>
            </td>
          </tr>
</table>
正如我在代码中已经提到的,标签的数量不是固定的。此外,还可能有:
  • 菜单1和菜单2
  • 只是菜单1
  • 只有菜单2
  • 没有菜单1和菜单2(只有

    例如,它可能看起来就像这样:

    DICT = {
        'menu1': ['Some data1','Foo1 Bar1'],
        'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
    }
    
    <table>
              <tr>
                <th width="100">menu1</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data1</li>
                        <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                        ... (amount of this tags isn't fixed)
                  </ul>
                </td>
              </tr>
    </table>
    
    这始终是页面的第一个表,因此我可以使用:

    table = soup.find_all('table')[0]
    
    提前感谢您的帮助

    html = """<table>
              <tr>
                <th width="100">menu1</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data1</li>
                        <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                  </ul>
                </td>
              </tr>
              <tr>
                <th width="100">menu2</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data2</li>
                        <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                        <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                        <li>Some data3</li>
                  </ul>
                </td>
              </tr>
    </table>"""
    
    import BeautifulSoup as bs
    
    soup = bs.BeautifulSoup(html)
    
    table = soup.findAll('table')[0]
    
    results = {}
    
    th = table.findChildren('th')#,text=['menu1','menu2'])
    
    for x in th:
        #print x
        results_li = []
        li = x.nextSibling.nextSibling.findChildren('li')
        for y in li:
            #print y.next
            results_li.append(y.next)
        results[x.next] = results_li
    
    print results
    
    html = """<table>
              <tr>
                <th width="100">menu1</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data1</li>
                        <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                  </ul>
                </td>
              </tr>
              <tr>
                <th width="100">menu2</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data2</li>
                        <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                        <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                        <li>Some data3</li>
                  </ul>
                </td>
              </tr>
    </table>"""
    
    import BeautifulSoup as bs
    
    soup = bs.BeautifulSoup(html)
    
    table = soup.findAll('table')[0]
    
    results = {}
    
    th = table.findChildren('th')#,text=['menu1','menu2'])
    
    for x in th:
        #print x
        results_li = []
        li = x.nextSibling.nextSibling.findChildren('li')
        for y in li:
            #print y.next
            results_li.append(y.next)
        results[x.next] = results_li
    
    print results
    
    {
        u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'], 
        u'menu1': [u'Some data1', u'Foo1']
    }