使用BeautifulSoup将html表解析为python字典
这是一段html代码,我正试图用BeautifulSoup解析它:使用BeautifulSoup将html表解析为python字典,python,html,dictionary,html-parsing,beautifulsoup,Python,Html,Dictionary,Html Parsing,Beautifulsoup,这是一段html代码,我正试图用BeautifulSoup解析它: <table> <tr> <th width="100">menu1</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>Some data
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1<a href="/link/to/bar1">Bar1</a></li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2<a href="/link/to/bar2">Bar2</a></li>
<li>Foo3<a href="/link/to/bar3">Bar3</a></li>
<li>Some data3</li>
... (amount of this tags isn't fixed too)
</ul>
</td>
</tr>
</table>
正如我在代码中已经提到的,标签的数量不是固定的。此外,还可能有:
)
例如,它可能看起来就像这样:
DICT = {
'menu1': ['Some data1','Foo1 Bar1'],
'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1<a href="/link/to/bar1">Bar1</a></li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
</table>
这始终是页面的第一个表,因此我可以使用:
table = soup.find_all('table')[0]
提前感谢您的帮助
html = """<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1<a href="/link/to/bar1">Bar1</a></li>
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2<a href="/link/to/bar2">Bar2</a></li>
<li>Foo3<a href="/link/to/bar3">Bar3</a></li>
<li>Some data3</li>
</ul>
</td>
</tr>
</table>"""
import BeautifulSoup as bs
soup = bs.BeautifulSoup(html)
table = soup.findAll('table')[0]
results = {}
th = table.findChildren('th')#,text=['menu1','menu2'])
for x in th:
#print x
results_li = []
li = x.nextSibling.nextSibling.findChildren('li')
for y in li:
#print y.next
results_li.append(y.next)
results[x.next] = results_li
print results
html = """<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1<a href="/link/to/bar1">Bar1</a></li>
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2<a href="/link/to/bar2">Bar2</a></li>
<li>Foo3<a href="/link/to/bar3">Bar3</a></li>
<li>Some data3</li>
</ul>
</td>
</tr>
</table>"""
import BeautifulSoup as bs
soup = bs.BeautifulSoup(html)
table = soup.findAll('table')[0]
results = {}
th = table.findChildren('th')#,text=['menu1','menu2'])
for x in th:
#print x
results_li = []
li = x.nextSibling.nextSibling.findChildren('li')
for y in li:
#print y.next
results_li.append(y.next)
results[x.next] = results_li
print results
{
u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'],
u'menu1': [u'Some data1', u'Foo1']
}