使用python 3.2检测html文件中的数字_Python_Python 3.x_Html Parsing

使用python 3.2检测html文件中的数字

python python-3.x

使用python 3.2检测html文件中的数字,python,python-3.x,html-parsing,Python,Python 3.x,Html Parsing,我有一个HTML文件，我想用python 3.2解析它样本：- <td class="ln">15</td><td class="sf3b2"><code> </code></td> <td class="ln">15</td><td class="sf3b2"><code> </code></td> 任务是检测未

我有一个HTML文件，我想用python 3.2解析它样本：-

<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>
<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>

任务是检测未标记的数字（仅在本例中为15），并将其存储在另一个文本文件中。我还不能决定使用哪个html解析器（lxml，beautifuldshop），因为我还不熟悉这个。你能指导我如何处理这个问题吗？提前谢谢

使这项工作变得非常容易。您可以使用该方法查找元素并对其进行处理：

soup = BeautifulSoup(html_doc)
tds = soup.find_all("td", "ln")
for td in tds:
    pass # do something here

你可以试试这样的

from BeautifulSoup import BeautifulSoup

def getPrintUnicode(soup):

    body=''
    if isinstance(soup, unicode):
        soup = soup.replace('&#39;',"'")
        soup = soup.replace('&quot;','"')
        soup = soup.replace('&nbsp;',' ')
        soup = soup.replace('&gt;','>')
        soup = soup.replace('&lt;','<')
        body = body + soup
    else:
        if not soup.contents:
            return ''
        con_list = soup.contents
        for con in con_list:
            body = body + getPrintUnicode(con)
    return body

print getPrintUnicode(BeautifulSoup('<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>'))

您可以在整个页面上使用这个getPrintUnicode（）函数。它将返回完整的内容。使用异常并将字符串转换为整数。例如

围绕数字的html代码总是相同的吗？围绕数字的代码总是相同的。但是，中间有些行没有数字，可以忽略。所有包含数字的行都遵循给定的格式

print int(getPrintUnicode(BeautifulSoup('<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>')))