Python 使用BeautifulSoup匹配html文档中的字符串，并在其出现的任何位置高亮显示该字符串_Python_Python 3.x_Beautifulsoup_Html Parsing

Python 使用BeautifulSoup匹配html文档中的字符串，并在其出现的任何位置高亮显示该字符串

python python-3.x

Python 使用BeautifulSoup匹配html文档中的字符串，并在其出现的任何位置高亮显示该字符串,python,python-3.x,beautifulsoup,html-parsing,Python,Python 3.x,Beautifulsoup,Html Parsing,我试图匹配HTML文档中的字符串，并特别突出显示它。我使用了BeautifulSoup和html.parser 到目前为止，我尝试使用find_all（）并传递要匹配的字符串，但它没有帮助，因为它返回元素中的整个文本我希望您指导我如何针对文档中的特定字符串并突出显示它例如：标记： <p>Lorem is simply dummy text of the printing and typesetting industry.</p> <p>Lorem

我试图匹配HTML文档中的字符串，并特别突出显示它。我使用了BeautifulSoup和html.parser

到目前为止，我尝试使用find_all（）并传递要匹配的字符串，但它没有帮助，因为它返回元素中的整个文本

我希望您指导我如何针对文档中的特定字符串并突出显示它

例如：标记：

 <p>Lorem  is simply dummy text of the printing and typesetting industry.</p> 
 <p>Lorem has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it

 <p><mark>Lorem</mark> is simply dummy text of the printing and typesetting industry.</p> 
 <p><mark>Lorem</mark> has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it

也许你可以试试这样的

soup = bs("<p>Lorem  is simply dummy text of the printing and typesetting industry.</p> ",'lxml')

# This is the word we want to put a tag around
special_word = 'Lorem'
content_orig = soup.p.text
split_content_orig = content_orig.split(special_word)

soup.p.string = ''  
soup.p.insert(len(soup.p), split_content_orig[0])

for i_word in split_content_orig[1:]:
# We need to create a new tag in every loop, otherwise it moves the tag around. Probably has something to do with each tag having a unique id()
    new_tag = soup.new_tag('mark')
    new_tag.string = special_word
    soup.p.insert(len(soup.p), new_tag)
    soup.p.insert(len(soup.p), i_word)

soup=bs（“Lorem只是印刷和排版行业的虚拟文本。”，“lxml”）
#这就是我们想在周围贴上标签的词
特殊单词='Lorem'
content\u orig=soup.p.text
split\u content\u orig=content\u orig.split（特殊词）
soup.p.string=''
soup.p.insert（len（soup.p）、split\u content\u orig[0]）
对于拆分内容中的i_单词，源[1:]：
#我们需要在每个循环中创建一个新标记，否则它会移动标记。可能与每个具有唯一id（）的标记有关
新标签=汤。新标签（'标记'）
new_tag.string=特殊单词
soup.p.插入（len（soup.p），新标签）
soup.p.插入（len（soup.p），i_单词）

我也有类似的问题，我在这里提出了我的问题：

也许其他人会对此作出回应，并找到更好的解决方案。但同时，如果您使用的是更复杂的html，我想您可以使用它，可能您不想在html中的任何地方替换文本。这可能会破坏链接、图像、样式等

您只能通过以下操作替换文本的实例：


def highlight_html(html, re_highlighter):
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup.strings:
        highlighted = re_highlighter.sub(r"<mark>\1</mark>", tag)
        if highlighted != tag:
            highligted_soup = BeautifulSoup(highlighted, 'html.parser')
            tag.replace_with(highligted_soup)
    return str(soup)

# create your re rule as needed...
re_highlighter = re.compile(r"Lorem...", flags=re.IGNORECASE)
highlighted_html = highlight_html(html, re_highlight)


def高亮显示html（html、re高亮显示）：
soup=BeautifulSoup（html，'html.parser'）
对于soup.strings中的标记：
高亮显示=re_highlighter.sub（r“\1”，标记）
如果突出显示！=标签：
highligted_soup=BeautifulSoup（突出显示为“html.parser”）
标记。将_替换为（高亮度_汤）
返回str（汤）
#根据需要创建您的re规则。。。
re_highlighter=re.compile（r“Lorem…”，flags=re.IGNORECASE）
突出显示的\u html=突出显示的\u html（html，重新突出显示）

详细阐述你的“重点”@RomanPerekhrest我编辑了这个问题。


def highlight_html(html, re_highlighter):
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup.strings:
        highlighted = re_highlighter.sub(r"<mark>\1</mark>", tag)
        if highlighted != tag:
            highligted_soup = BeautifulSoup(highlighted, 'html.parser')
            tag.replace_with(highligted_soup)
    return str(soup)

# create your re rule as needed...
re_highlighter = re.compile(r"Lorem...", flags=re.IGNORECASE)
highlighted_html = highlight_html(html, re_highlight)