Python 导航html表lxml
我有一些html,看起来像:Python 导航html表lxml,python,lxml,Python,Lxml,我有一些html,看起来像: <html> <body> <table cellpadding="0" cellspacing="0" border="0" width="100%"> <tr> <td align="left" colspan="4"> <!-- BEGIN NEXT PREV LINKS --> <table cellspacing="2"
<html>
<body>
<table cellpadding="0" cellspacing="0" border="0" width="100%">
<tr>
<td align="left" colspan="4">
<!-- BEGIN NEXT PREV LINKS -->
<table cellspacing="2" cellpadding="0" border="0">
<tr>
<td align="left"><font style="color:gray">Previous</font> </td>
<td align="center" colspan="2" nowrap><b>1-100 of 273 employees</b></td>
<td align="right"> <a href="">Next</a></td>
</tr>
<tr>
<td align="left" colspan="2"><font style="color:gray">First Page</font></td>
<td align="right" colspan="2"> <a href="">Last Page</a></td>
</tr>
</table>
<!-- END NEXT PREV LINKS -->
</td>
<td colspan="9" align="right">
<a href="">Add Checked to Favorites</a>
<br>
<a href="">Add Checked to Excluded</a>
</td>
</tr>
<tr>
<td rowspan="2"></td><td rowspan="2"></td> <td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href=""/></td>
<td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href="">Position</a></b></td>
<td colspan="2" align="center" valign="bottom" height="16"><b>Ratings</b><br><img src="/images/shim_333333.gif" width="130" height="1" alt="" hspace="5"></td> <td rowspan="2"> </td> <td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href="">Birth Date</a></b></td>
<td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href="">States</a></b></td>
<td rowspan="2"> </td><td rowspan="2"></td> <td rowspan="2" colspan="3" align="right" valign="bottom"><a href="">Clear All</a> </td> </tr>
<tr>
<td align="center"><b><a href="">In-State<br>Rating</a></b></td>
<td align="center"><b><a href="">Out of State<br>Rating</a></b></td>
</tr>
<tr>
<td colspan="13" valign="bottom"><img src="/images/shim.gif" width="100%" height="1" alt=""></td>
</tr> <tr>
<td align="right" colspan=13><img src="/images/shim_dddddd.gif" width="100%" height="1" border="0" alt=""></td>
</tr> <tr >
<td></td><td><b style="">X</b></td>
<td nowrap><p><a href="">Cruise, Tom</a> </p></td>
<td nowrap>Actor </td>
<td align="center"><img src="/images/stars_2_sm_green.gif" alt="instate Recommendation Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/stars_4_sm.gif" alt="Summary Estimate Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>1948 </td>
<td nowrap>CA</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="198720" style="height:15px"></td>
</tr> <tr>
<td align="right" colspan=13><img src="/images/shim_dddddd.gif" width="100%" height="1" border="0" alt=""></td>
</tr> <tr >
<td><b style="">X</b></td><td></td>
<td nowrap><p><a href="">Schwarzenegger, Arnold</a> </p></td>
<td nowrap>Governor </td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="instate Recommendation Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="Summary Estimate Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>No Current Date </td>
<td nowrap>-</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="61184" style="height:15px"></td>
</tr> <tr >
<td><b style="">X</b></td><td></td>
<td nowrap><p><a href="">Obama, Barack</a> </p></td>
<td nowrap>President </td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="instate Recommendation Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="Summary Estimate Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>No Current Date </td>
<td nowrap>-</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="225747" style="height:15px"></td>
</tr>
<tr height="15">
<td align="right" colspan="14">
<!-- BEGIN NEXT PREV LINKS -->
<table cellspacing="2" cellpadding="0" border="0">
<tr>
<td align="left"><font style="color:gray">Previous</font> </td>
<td align="center" colspan="2" nowrap><b>1-100 of 273 employees</b></td>
<td align="right"> <a href="">Next</a></td>
</tr>
<tr>
<td align="left" colspan="2"><font style="color:gray">First Page</font></td>
<td align="right" colspan="2"> <a href="">Last Page</a></td>
</tr>
</table>
<!-- END NEXT PREV LINKS -->
</td>
</tr> <tr>
<td colspan="12" valign="bottom" nowrap><br>
<b style="">X</bfdgdfgb style="">X</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
<b style="c">X</b>dfgfdg<b style="">X</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br> <b style="">F</b>: A dsd "<b style="">F</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
dfgdfg"<b style="">F</b>"Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
<b style="">E</b>gfhbgdfg"<b style="">E</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit
</td>
</tr><tr><td colspan="20">
<table cellpadding="0" cellspacing="0" border="0" width="100%" align="center">
<tr>
<td colspan="2"><img src="/images/shim.gif" width="100%" height="5" alt=""></td>
</tr>
<tr>
<td valign="top">States: </td>
<td>CA=California; ND=North Dakota</td>
</tr>
</table>
</td></tr>
</table></body>
</html>
这会产生一个非常混乱的列表。我的最终目标就是获得成功
[['Cruise, Tom', 'Actor', '1948', 'CA'], ['Schwarzenegger, Arnold', 'Governor', 'No Current Date', '-'], ...]
然而,表中包含的所有这些信息产生了许多奇怪的元素。我知道我可以通过用单个空格替换来清理结果
\xa0
。我真的不知道如何进一步导航。谢谢 您必须遍历html文档并获得更精确的XPath。此外,您还面临着不同元素中的相关数据需要两个XPath表达式的挑战。这需要进行一些操作,以获得最终的相关结果:
import lxml.etree as et
with open("employeetest.htm",'r') as f:
text = f.read().replace(' ', '').replace(';', '')
root = et.HTML(text)
# XPATH LISTS (W/ RELATED ITEMS)
items1 = root.xpath("//td/p/a/text()")
items2 = root.xpath("//td[p/a/text()]/following-sibling::td/text()")
# NUMBER OF ITEMS RELATED BETWEEN EACH
r = int(len(items2)/len(items1))
# ITERATE THROUGH WITH LIST SLICE AND APPEND
data = []
for i in range(r):
inner = []
inner.append(items1[i])
for j in items2[0+i*r:2+i*r]: # SLICE BY EVERY THREE ITEMS
inner.append(j)
data.append(inner)
print(data)
# [['Cruise, Tom', 'Actor', '1948'],
# ['Schwarzenegger, Arnold', 'Governor', 'No Current Date'],
# ['Obama, Barack', 'President', 'No Current Date']]
您必须遍历html文档并获得更精确的XPath。此外,您还面临着不同元素中的相关数据需要两个XPath表达式的挑战。这需要进行一些操作,以获得最终的相关结果:
import lxml.etree as et
with open("employeetest.htm",'r') as f:
text = f.read().replace(' ', '').replace(';', '')
root = et.HTML(text)
# XPATH LISTS (W/ RELATED ITEMS)
items1 = root.xpath("//td/p/a/text()")
items2 = root.xpath("//td[p/a/text()]/following-sibling::td/text()")
# NUMBER OF ITEMS RELATED BETWEEN EACH
r = int(len(items2)/len(items1))
# ITERATE THROUGH WITH LIST SLICE AND APPEND
data = []
for i in range(r):
inner = []
inner.append(items1[i])
for j in items2[0+i*r:2+i*r]: # SLICE BY EVERY THREE ITEMS
inner.append(j)
data.append(inner)
print(data)
# [['Cruise, Tom', 'Actor', '1948'],
# ['Schwarzenegger, Arnold', 'Governor', 'No Current Date'],
# ['Obama, Barack', 'President', 'No Current Date']]
不确定预期输出中的
…
应该是什么,但要获取前三个子列表中的数据,可以缩小搜索范围,查找具有nowrap属性且总共只有一个属性的TR:
import lxml.etree as et
with open("employeetest.htm",'r') as f:
text = f.read().replace(' ', '').replace(';', '')
root = et.HTML(text)
# XPATH LISTS (W/ RELATED ITEMS)
items1 = root.xpath("//td/p/a/text()")
items2 = root.xpath("//td[p/a/text()]/following-sibling::td/text()")
# NUMBER OF ITEMS RELATED BETWEEN EACH
r = int(len(items2)/len(items1))
# ITERATE THROUGH WITH LIST SLICE AND APPEND
data = []
for i in range(r):
inner = []
inner.append(items1[i])
for j in items2[0+i*r:2+i*r]: # SLICE BY EVERY THREE ITEMS
inner.append(j)
data.append(inner)
print(data)
# [['Cruise, Tom', 'Actor', '1948'],
# ['Schwarzenegger, Arnold', 'Governor', 'No Current Date'],
# ['Obama, Barack', 'President', 'No Current Date']]
from lxml import html
root = html.fromstring(h)
rows = root.xpath("//tr[td[@nowrap and text() and count(@*)=1]]")
data = list()
for row in rows:
print(row.xpath(".//td[@nowrap]//text()"))
输出:
['Cruise, Tom', u'\xa0', u'Actor\xa0', u'1948\xa0', 'CA']
['Schwarzenegger, Arnold', u'\xa0', u'Governor\xa0', u'No Current Date\xa0', '-']
['Obama, Barack', u'\xa0', u'President\xa0', u'No Current Date\xa0', '-']
不确定预期输出中的
…
应该是什么,但要获取前三个子列表中的数据,可以缩小搜索范围,查找具有nowrap属性且总共只有一个属性的TR:
import lxml.etree as et
with open("employeetest.htm",'r') as f:
text = f.read().replace(' ', '').replace(';', '')
root = et.HTML(text)
# XPATH LISTS (W/ RELATED ITEMS)
items1 = root.xpath("//td/p/a/text()")
items2 = root.xpath("//td[p/a/text()]/following-sibling::td/text()")
# NUMBER OF ITEMS RELATED BETWEEN EACH
r = int(len(items2)/len(items1))
# ITERATE THROUGH WITH LIST SLICE AND APPEND
data = []
for i in range(r):
inner = []
inner.append(items1[i])
for j in items2[0+i*r:2+i*r]: # SLICE BY EVERY THREE ITEMS
inner.append(j)
data.append(inner)
print(data)
# [['Cruise, Tom', 'Actor', '1948'],
# ['Schwarzenegger, Arnold', 'Governor', 'No Current Date'],
# ['Obama, Barack', 'President', 'No Current Date']]
from lxml import html
root = html.fromstring(h)
rows = root.xpath("//tr[td[@nowrap and text() and count(@*)=1]]")
data = list()
for row in rows:
print(row.xpath(".//td[@nowrap]//text()"))
输出:
['Cruise, Tom', u'\xa0', u'Actor\xa0', u'1948\xa0', 'CA']
['Schwarzenegger, Arnold', u'\xa0', u'Governor\xa0', u'No Current Date\xa0', '-']
['Obama, Barack', u'\xa0', u'President\xa0', u'No Current Date\xa0', '-']
数据不在任何表中,
..
在您的预期输出中代表什么?也许我弄错了,但它不是封装在中吗?是的,我错过了开头标记。因此,您想要从发布的内容中得到的只是问题中的三个子列表?数据不在任何表中,..
在您的预期输出中代表了什么?也许我弄错了,但它不是封装在中吗?是的,我错过了开头标记。所以你想要的只是你问题中的三个子列表?谢谢!这很有效。我只需要更改编码(出于我的目的),以防其他人偶然发现这一点并需要该信息。谢谢!这很有效。我只需要更改编码(出于我的目的),以防其他人偶然发现并需要该信息。