Python解析:从非标准布局的html测试文件中提取数据

Python解析:从非标准布局的html测试文件中提取数据,python,text,html-parsing,text-parsing,Python,Text,Html Parsing,Text Parsing,我需要帮助解析一个html文本文件,该文件的布局我不知道如何解析,并且可以真正使用帮助 迄今为止的代码: import urllib,os, urllib2, webbrowser, StringIO, re from BeautifulSoup import BeautifulSoup from urllib import urlopen urlfile = open('output.txt','r') html = urlfile soup = BeautifulSoup(''.joi

我需要帮助解析一个html文本文件,该文件的布局我不知道如何解析,并且可以真正使用帮助

迄今为止的代码:

import urllib,os, urllib2, webbrowser, StringIO, re
from BeautifulSoup import BeautifulSoup
from urllib import urlopen

urlfile = open('output.txt','r')

html = urlfile

soup = BeautifulSoup(''.join(html))

print soup.prettify()
table = soup.find('table', id="dgProducts__ctl2_lblCountry")
rows = table.findAll('<span id="dgProducts__ctl2_lblCountry">')

for tr in rows:
  cols = tr.findAll('td')
for td in cols:
   text = ''.join(td.find(text=True))
   print text+"|",
print
缩写的.html文件数据结构:

</tr><tr bgcolor="White">
  <td><font color="#330099" size="1">
         <span><font size="2">
           <input id="dgProducts__ctl12_ckCompare" type="checkbox" name="dgProducts:_ctl12:ckCompare" onclick="checkSelected(this.form, this);" />
           </font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblModel1"><font size="2">  
          <a href='ProductDisplay.aspx?return=pm&action=view&search=true&productid=4592&ProductType=1&epeatcountryid=1'>Ace Vision 7HS</a></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblCountry">United States</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblProductCategory1"><font size="2">Desktops</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblRating1"><font size="2">Gold</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblPoints1">18</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblEnergyStar">5.0</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblMonitorType1"><font size="2"></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblMonitorSize"><font size="2"></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblListingDate1"><font size="2">3/16/2010</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblStatus"><font size="2">Active</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblExceptions" align="center"><a href='#' onclick=ShowExceptions('Exceptions.aspx?id=4592');>    
          <img src='http://www.epeat.net/Images/inform.gif' title='Click to view exceptions' alt='Click to view exceptions' border='0'></a></span>
        </font></td>

美国
台式机
金
18
5
3/16/2010
忙碌的

我建议您使用名为MiniDom或xml.dom.MiniDom的模块。它使解析XML和HTML文件变得容易。

只是偶然发现了
''.join(HTML)
而不是
HTML.read()
。嗯,每个人都有自己的:)“迄今为止的代码”。。。“我想做的事”。。。和什么不起作用?你有什么问题?你需要什么帮助?
</tr><tr bgcolor="White">
  <td><font color="#330099" size="1">
         <span><font size="2">
           <input id="dgProducts__ctl12_ckCompare" type="checkbox" name="dgProducts:_ctl12:ckCompare" onclick="checkSelected(this.form, this);" />
           </font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblModel1"><font size="2">  
          <a href='ProductDisplay.aspx?return=pm&action=view&search=true&productid=4592&ProductType=1&epeatcountryid=1'>Ace Vision 7HS</a></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblCountry">United States</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblProductCategory1"><font size="2">Desktops</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblRating1"><font size="2">Gold</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblPoints1">18</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblEnergyStar">5.0</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblMonitorType1"><font size="2"></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblMonitorSize"><font size="2"></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblListingDate1"><font size="2">3/16/2010</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblStatus"><font size="2">Active</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblExceptions" align="center"><a href='#' onclick=ShowExceptions('Exceptions.aspx?id=4592');>    
          <img src='http://www.epeat.net/Images/inform.gif' title='Click to view exceptions' alt='Click to view exceptions' border='0'></a></span>
        </font></td>