使用python删除特定的html标记_Python_Html

使用python删除特定的html标记

python html

使用python删除特定的html标记,python,html,Python,Html,我在HTML单元格中有一些HTML表，如下所示： miniTable='<table style="width: 100%%" bgcolor="%s"> <tr><td><font color="%s"><b>%s</b></td></tr> </table>' % ( bgcolor, fontColor, floatNumber)

我在HTML单元格中有一些HTML表，如下所示：

miniTable='<table style="width: 100%%" bgcolor="%s">
               <tr><td><font color="%s"><b>%s</b></td></tr>
           </table>' % ( bgcolor, fontColor, floatNumber)

html += '<td>' + miniTable + '</td>'

其中

floatNumber

是浮点数的字符串表示形式。我不希望以任何方式修改任何其他HTML标记。我曾想过使用string.replace或regex，但我被难住了。

使用html解析库，如获取所需的元素和包含的文本

最终的代码应该是这样的

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

for t in soup.find_all("table"): # the actual selection depends on your specific code
    content = t.get_text()
    # content should be the float number

使用html解析库，如获取所需的元素和包含的文本

最终的代码应该是这样的

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

for t in soup.find_all("table"): # the actual selection depends on your specific code
    content = t.get_text()
    # content should be the float number

如果您无法安装和使用Beautiful Soup（否则首选BS，如@otto allmendinger所建议的）：

重新导入
s='1.23'
结果=浮点（re.sub（r“]*>| |]+>|“，”，s））

如果您不能安装和使用Beautiful Soup（否则首选BS，如@otto allmendinger所建议的）：

重新导入
s='1.23'
结果=浮点（re.sub（r“]*>| |]+>|“，”，s））

感谢您的快速回复！我使用的是一些专有的开发环境，所以我无法安装和使用漂亮的soupi。如果html代码格式良好，您也可以尝试使用Python内置的xml解析器。很有趣，但是。不使用正则表达式？好的。另外，如果您只想删除预定义数量的HTML标记，而不想解析属性、构建树等，为什么不使用轻量级正则表达式？@fedosov您链接的是用于解析选择器的代码，而不是用于解析XHTMLHANKS以获得快速回复的代码！我使用的是一些专有的开发环境，所以我无法安装和使用漂亮的soupi。如果html代码格式良好，您也可以尝试使用Python内置的xml解析器。很有趣，但是。不使用正则表达式？好的。另外，如果您只想删除预定义数量的HTML标记，而不想解析属性、构建树等，为什么不使用轻量级正则表达式？@fedosov您正在链接用于解析选择器的代码，而不是用于解析我的应用程序的XHTMLF，这非常有用！奥托的解决方案也是伟大的，如果我可以使用美丽的汤为我的应用程序，这是伟大的作品！如果我能用上漂亮的汤，奥托的解决方案也很棒

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

for t in soup.find_all("table"): # the actual selection depends on your specific code
    content = t.get_text()
    # content should be the float number

import re
s = '<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>1.23</b></td></tr></table>'
result = float(re.sub(r"<.?table[^>]*>|<.?t[rd]>|<font[^>]+>|<.?b>", "", s))