Python 用html文档中的元素替换多个字符串_Python_Html_Beautifulsoup

Python 用html文档中的元素替换多个字符串

python html

Python 用html文档中的元素替换多个字符串,python,html,beautifulsoup,Python,Html,Beautifulsoup,我有多个字符串，我想在HTML文档中包装HTML标记。我想保持文本不变，但用包含该字符串的HTML元素替换字符串此外，我要替换的一些字符串包含我要替换的其他字符串。在这些情况下，我希望应用较大字符串的替换，而忽略较小字符串的替换此外，我只想在这些字符串完全包含在同一元素中时执行此替换这是我的替换名单 replacement_list = [ ('foo', '<span title="foo" class="customclass34">foo</span>

我有多个字符串，我想在HTML文档中包装HTML标记。我想保持文本不变，但用包含该字符串的HTML元素替换字符串

此外，我要替换的一些字符串包含我要替换的其他字符串。在这些情况下，我希望应用较大字符串的替换，而忽略较小字符串的替换

此外，我只想在这些字符串完全包含在同一元素中时执行此替换

这是我的替换名单

replacement_list = [
    ('foo', '<span title="foo" class="customclass34">foo</span>'),
    ('foo bar', '<span id="id21" class="customclass79">foo bar</span>')
]

替换列表=[
（‘foo’、‘foo’），
（“foobar”、“foobar”）
]

给定以下html：

<html>
<body>
<p>Paragraph contains foo</p>
<p>Paragraph contains foo bar</p>
</body>
</html>


段落包含foo
段落包含foo-bar

我想替换以下内容：

<html>
<body>
<p>Paragraph contains <span title="foo" class="customclass34">foo</span></p>
<p>Paragraph contains <span id="id79" class="customclass79">foo bar</span</p>
</body>
</html>


段落包含foo
段落包含foo-bar处理小文件时，最好逐行读取文件，并在每一行中替换要替换的内容，然后将所有内容写入新文件
假设您的文件名为output.html
：
replacement_list = {'foo': '<span title="foo" class="customclass34">foo</span>', 'foo bar':'<span id="id21" class="customclass79">foo bar</span>'}

with open('output.html','w') as dest :
    with open('test.html','r') as src :
        for line in src:   #### reading the src file line by line
            str_possible = []
            for string in replacement_list.keys(): #### looping over all the strings you are looking for
                if string in line: ### checking if this string is in the line
                    str_possible.append(string)
            if len(str_possible) >0:
                str_final = max(str_possible, key=len)  ###taking the appropriate one, which is the longest
                line = line.replace(str_final,replacement_list[str_final])

            dest.write(line)

replacement_list={'foo'：'foo'，'foo-bar'：'foo-bar}
以open（'output.html'，'w'）作为dest：
以open（'test.html'，'r'）作为src：
对于src中的行：#####逐行读取src文件
str_可能=[]
对于替换_list.keys（）中的字符串：######在所有要查找的字符串上循环
如果字符串在行中：###检查此字符串是否在行中
str_可能。追加（字符串）
如果len（str_可能）>0：
str_final=max（str_可能，key=len）#####取适当的一个，它是最长的
line=line.replace（str\u final，replacement\u list[str\u final]）
目标写入（行）

我还建议您检查python中字典的使用情况，这是我用于replacement\u list
的对象
最后，如果行上最多有一个字符串，则此代码将起作用。如果有两个，则需要对其进行一些调整，但这将为您提供总体思路。
我认为这与您所寻找的非常接近。您可以使用soup.find_all（string=True）
仅获取navigablesting元素，然后进行替换
from bs4 import BeautifulSoup
html="""
<html>
<body>
<p>Paragraph contains foo</p>
<p>Paragraph contains foo bar</p>
</body>
</html>
"""
replacement_list = [
    ('foo', '<span title="foo" class="customclass34">foo</span>'),
    ('foo bar', '<span id="id21" class="customclass79">foo bar</span>')
]
soup=BeautifulSoup(html,'html.parser')
for s in soup.find_all(string=True):
    for item in replacement_list[::-1]: #assuming that it is in ascending order of length
        key,val=item
        if key in s:
            new_s=s.replace(key,val)
            s.replace_with(BeautifulSoup(new_s,'html.parser')) #restrict youself to this built-in parser
            break#break on 1st match
print(soup)

#generate a new valid soup that treats span as seperate tag if you want
soup=BeautifulSoup(str(soup),'html.parser')
print(soup.find_all('span'))

从bs4导入美化组
html=”“”
段落包含foo
段落包含foo-bar
"""
替换列表=[
（‘foo’、‘foo’），
（“foobar”、“foobar”）
]
soup=BeautifulSoup（html，'html.parser'）
对于汤中的s.find_all（string=True）：
对于替换列表[：-1]：#中的项目，假设它是按长度升序排列的
键，val=项
如果输入s：
新_s=s.替换（键，val）
s、 将_替换为（BeautifulSoup（新的_，'html.parser'））#将您自己限制为该内置解析器
休息#第一场比赛休息
印花（汤）
#如果需要，生成一个新的有效soup，将span视为单独的标记
soup=BeautifulSoup（str（soup），'html.parser'）
打印（soup.find_all（'span'））

产出：
<html>
<body>
<p>Paragraph contains <span class="customclass34" title="foo">foo</span></p>
<p>Paragraph contains <span class="customclass79" id="id21">foo bar</span></p>
</body>
</html>

[<span class="customclass34" title="foo">foo</span>, <span class="customclass79" id="id21">foo bar</span>]


段落包含foo
段落包含foo-bar
[foo，foo-bar]
我找到了一个解决方案
我必须为每个不同的字符串遍历HTML，我想将HTML标记包装起来。这似乎效率低下，但我找不到更好的方法
我在插入的所有标记中添加了一个类，用于检查我尝试替换的字符串是否是已替换的较大字符串的一部分
此解决方案也不区分大小写（它将标记环绕字符串“fOo”），同时保留原始文本的大小写
def html_update(input_html):
    from bs4 import BeautifulSoup
    import re

    soup = BeautifulSoup(input_html)

    replacement_list = [
        ('foo', '<span title="foo" class="customclass34 replace">', '</span>'),
        ('foo bar', '<span id="id21" class="customclass79 replace">', '</span>')
    ]
    # Go through list in order of decreasing length
    replacement_list = sorted(replacement_list, key = lambda k: -len(k[0]))

    for item in replacement_list:
        replace_regex = re.compile(item[0], re.IGNORECASE)
        target = soup.find_all(string=replace_regex)
        for v in target:
            # You can use other conditions here, like (v.parent.name == 'a')
            # to not wrap the tags around strings within links
            if v.parent.has_attr('class') and 'replace' in v.parent['class']:
                # The match must be part of a large string that was already replaced, so do nothing
                continue 

            def replace(match):
                return '{0}{1}{2}'.format(item[1], match.group(0), item[2])

            new_v = replace_regex.sub(replace, v)
            v.replace_with(BeautifulSoup(new_v, 'html.parser'))
    return str(soup)

def html\u更新（输入html）：
从bs4导入BeautifulSoup
进口稀土
soup=BeautifulSoup（输入\u html）
替换列表=[
("foo","foo","foo","foo",，
（“foo-bar”、“foo-bar”、“foo-bar”）
]
#按长度递减的顺序浏览列表
替换列表=已排序（替换列表，键=lambda k:-len（k[0]））
对于替换列表中的项目：
替换_regex=re.compile（项[0]，re.IGNORECASE）
target=soup.find_all（string=replace_regex）
对于目标中的v：
#您可以在这里使用其他条件，例如（v.parent.name=='a'）
#不将标记环绕链接中的字符串
如果v.parent.在v.parent['class']中有_attr（'class'）和'replace'：
#匹配必须是已替换的大字符串的一部分，因此请不要执行任何操作
继续
def更换（匹配）：
返回'{0}{1}{2}'。格式（项[1]，匹配组（0），项[2]）
新建=替换正则表达式sub（替换，v）
v、 将_替换为（BeautifulSoup（新的_v，'html.parser'））
返回str（汤）
此代码将替换“foo”并在第二行保留“bar”不变，因此“foo bar”根本不会被替换。哦，对了，我走得有点太快了。我刚刚根据您的评论更新了我的代码。此解决方案的问题是，如果字符串出现在html标记中，它将替换字符串，从而导致格式错误的html。我不确定我是否理解，因为我对html了解不多，但是你的意思是如果你有
和
它会取代它吗？在这种情况下，您只需检查行中是否没有用于修改HTML的，通常最好使用HTML库，而不是简单的字符串替换。边缘案例太多，无法自己可靠地完成。在本例中，使用Beautiful Soup提取文本（去掉HTML标记），然后进行替换可能会奏效，因此我现在正尝试这样做。不过我很欣赏这个答案。我投票赞成一个接近的解决方案，它向我展示了可以使用的工具。我正在使用的HTML可能在单个NavigableString对象中有多个要替换的字符串匹配项，因此在第一个匹配项上断开以防止重复