如何使用html2text/beautifulsouppython删除[font]标记_Python_Regex

如何使用html2text/beautifulsouppython删除[font]标记

python regex

如何使用html2text/beautifulsouppython删除[font]标记,python,regex,Python,Regex,我正在使用BeautifulSoup并从我的网站获得结果，这是一段带有很多标记的代码： [font='Times New Roman']THIS[/font][font='Times New Roman']<s

我正在使用BeautifulSoup并从我的网站获得结果，这是一段带有很多标记的代码：

<span style="color: blue;"><span style="color: blue;">[font='Times New Roman']<span style="font-size: 22pt;">THIS</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> IS </span>[/font]<span style="color: #FF3300;"><span style="color: #FF3300;">[font='Times New Roman']<span style="font-size: 22pt;">A TEST</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> USING </span>[/font]<span style="color: #00CC66;"><span style="color: #00CC66;">[font='Times New Roman']<span style="font-size: 22pt;">SOME</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> BEAUTIFUL </span>[/font]<span style="color: fuchsia;"><span style="color: fuchsia;">[font='Times New Roman']<span style="font-size: 22pt;">SOUP</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> | </span>[/font]<span style="color: #00CCFF;"><span style="color: #00CCFF;">[font='Times New Roman']<span style="font-size: 22pt;">96786</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> AND </span>[/font]<span style="color: #CC33FF;"><span style="color: #CC33FF;">[font='Times New Roman']<span style="font-size: 22pt;">HTML2TEXT</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> TO LEARN </span>[/font]<span style="color: red;"><span style="color: red;">[font='Times New Roman']<span style="font-size: 22pt;">NEW THING</span>[/font]</span></span>

到目前为止，我得到的最好结果是：

[font='Times New Roman']THIS[/font][font='Times New Roman'] THIS
[/font][font='Times New Roman']IS[/font][font='Times New
Roman'] A TEST [/font][font='Times New Roman']USING[/font][font='Times New
Roman'] BEAUTIFUL [/font][font='Times New Roman'] SOUP [/font][font='Times New Roman']
| [/font][font='Times New Roman']96786[/font][font='Times New Roman'] AND [/font][font='Times New Roman'] HTML2TEXT [/font][font='Times New Roman'] TO LEARN [/font][font='Times New Roman']NEW THING[/font]

如何使用html2text+beautifulsoup或任何其他方法去除[font]标记？多谢各位

我的方法是使用字符串替换将[font…]和[/font]替换为“”，但这似乎效率低下。我们还有其他方法可以解决吗？

看起来您的输入是HTML和BBCode的混合。BeautifulSoup和html2text都用于从HTML解析和提取文本，但不是BBCode

一个简单的解决方案是在使用BeautifulSoup或html2text进行处理之前，将[font]BBCode“标记”转换为HTML。您可以使用正则表达式进行转换，请参见下面的

convert\bbcode\u font

。（请注意，这实际上并不会将您的输入转换为“有效”的HTML4字体标记，但html2text仍会处理输入。）

重新导入
导入html2text
def convert_bbcode_字体（html）：
flags=re.IGNORECASE | re.MULTILINE
#替换开始字体标记
html=re.sub（r'\[font\s*（[^\]]+）\]'，''，html，flags=flags）
#替换结束字体标记
html=re.sub（r'\[/font\s*\]'，''，html，flags=flags）
返回html
def extract_文本（html）：
html=转换字体（html）
h=html2text.html2text（）
h、 忽略链接=真
h、 忽略图像=真
h、 忽略强调=真
返回h.handle（html）
输入=”“
[font='Times New Roman']这是一个测试，它使用了[/font][font='Times New Roman']一些[/font][font='Times New Roman']漂亮的[/font][font='Times New Roman']汤[/font][font='Times New Roman'][font='Times New Roman']和[/font][font='Times New Roman']HTML2TEXT[/font][font='Times New Roman']学习[/font][font='Times New Roman']新事物[/font]
"""
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
打印摘录文本（输入）

是html中的标记分隔符吗？我怀疑这是感谢，那是bb代码，基本上是bb代码和html代码的混合。谢谢你的回答，这是解决问题的另一个有趣的方法

[font='Times New Roman']THIS[/font][font='Times New Roman'] THIS
[/font][font='Times New Roman']IS[/font][font='Times New
Roman'] A TEST [/font][font='Times New Roman']USING[/font][font='Times New
Roman'] BEAUTIFUL [/font][font='Times New Roman'] SOUP [/font][font='Times New Roman']
| [/font][font='Times New Roman']96786[/font][font='Times New Roman'] AND [/font][font='Times New Roman'] HTML2TEXT [/font][font='Times New Roman'] TO LEARN [/font][font='Times New Roman']NEW THING[/font]

import re
import html2text


def convert_bbcode_fonts(html):
    flags = re.IGNORECASE | re.MULTILINE
    # replace start font tags
    html = re.sub(r'\[font\s*([^\]]+)\]', '<font \1>', html, flags=flags)
    # replace end font tags
    html = re.sub(r'\[/font\s*\]', '</font>', html, flags=flags)
    return html

def extract_text(html):
    html = convert_bbcode_fonts(html)
    h = html2text.HTML2Text()
    h.ignore_links = True
    h.ignore_images = True
    h.ignore_emphasis = True
    return h.handle(html)

INPUT = """
<span style="color: blue;"><span style="color: blue;">[font='Times New Roman']<span style="font-size: 22pt;">THIS</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> IS </span>[/font]<span style="color: #FF3300;"><span style="color: #FF3300;">[font='Times New Roman']<span style="font-size: 22pt;">A TEST</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> USING </span>[/font]<span style="color: #00CC66;"><span style="color: #00CC66;">[font='Times New Roman']<span style="font-size: 22pt;">SOME</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> BEAUTIFUL </span>[/font]<span style="color: fuchsia;"><span style="color: fuchsia;">[font='Times New Roman']<span style="font-size: 22pt;">SOUP</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> | </span>[/font]<span style="color: #00CCFF;"><span style="color: #00CCFF;">[font='Times New Roman']<span style="font-size: 22pt;">96786</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> AND </span>[/font]<span style="color: #CC33FF;"><span style="color: #CC33FF;">[font='Times New Roman']<span style="font-size: 22pt;">HTML2TEXT</span>[/font]</span></span>[font='Times New Roman']<span style="font-size: 22pt;"> TO LEARN </span>[/font]<span style="color: red;"><span style="color: red;">[font='Times New Roman']<span style="font-size: 22pt;">NEW THING</span>[/font]</span></span>
"""

if __name__ == '__main__':
    print extract_text(INPUT)