Python 如何从字符串中删除两个标记之间出现的所有子字符串?

Python 如何从字符串中删除两个标记之间出现的所有子字符串?,python,regex,Python,Regex,我正在尝试删除以下字符串中之间出现的所有子字符串,并同时删除: txt = '<p>Large pythons were <pre><code> the subject of many </code></pre> a news story </p>\n last year due to the fact that there were at least two deaths <pre><code> d

我正在尝试删除以下字符串中
之间出现的所有子字符串,并同时删除

txt = '<p>Large pythons were <pre><code> the subject of many </code></pre> a news story </p>\n last year due to the fact that there were at least two deaths <pre><code> directly attributable to them </code></pre>. Tragically, the deaths would not have happened had <pre><code> the owners of these snakes kept them </code></pre> safely, and responsibly, contained. The following article, by David Chiszar, Hobart M. Smith, <a href= Albert Petkus and Joseph Dougherty </a>, was recently published in the Bulletin of the Chicago Herpetological Society, and represents the first clear, and accurate, <p> account of the death that occurred July 1993</p>\n'
def remsubstr( s, first, last ):
if first and last not in s:
    return s

try:
    start = s.index( first ) + len( first )
    end = s.index( last, start )
    d = (s[:start] +" "+ s[end:]).replace('<p>', '').replace('</p>\n', '')
    started = d.index("<pre><code>" )
    ended = d.index("</code></pre>") + len("</code></pre>")
    nw = d.replace(d[started:ended], '')

    if first and last in nw:
        start = nw.index( first ) + len( first )
        end = nw.index( last, start )
        d1 = (nw[:start] +" "+ nw[end:])
        started = d1.index("<pre><code>" )
        ended = d1.index("</code></pre>") + len("</code></pre>")
        nw1 = d1.replace(d1[started:ended], '')

        if first and last in nw1:
            start = nw1.index( first ) + len( first )
            end = nw1.index( last, start )
            d2 = (nw1[:start] +" "+ nw1[end:])
            started = d2.index("<pre><code>" )
            ended = d2.index("</code></pre>") + len("</code></pre>")
            nw2 = d2.replace(d2[started:ended], '')
            return nw2

        return nw1

    return nw

except ValueError:
    return ""
remsubstr(txt,"<pre><code>", "</code></pre>")
去年有一则新闻报道

\n因为至少有两人死亡。不幸的是,如果能够安全、负责任地控制住死亡,就不会发生。下面这篇文章由David Chiszar,Hobart M.Smith,使用,标准字符串操作对于XML文件中的嵌套不是最佳的。

我建议使用。在那里可以组合.find_all()和.decompose()。在您的情况下,这应该可以做到:

import bs4

txt = '<p>Large pythons were <pre><code> the subject of many </code></pre> a news story </p>\n last year due to the fact that there were at least two deaths <pre><code> directly attributable to them </code></pre>. Tragically, the deaths would not have happened had <pre><code> the owners of these snakes kept them </code></pre> safely, and responsibly, contained. The following article, by David Chiszar, Hobart M. Smith, <a href= Albert Petkus and Joseph Dougherty </a>, was recently published in the Bulletin of the Chicago Herpetological Society, and represents the first clear, and accurate, <p> account of the death that occurred July 1993</p>\n'
soup = bs4.BeautifulSoup(txt, "html.parser")
for tag in soup.find_all('pre'):
    if tag.find('code'):
        tag.decompose()

result = str(soup)

去年有一则新闻报道

\n因为至少有两人死亡。不幸的是,如果能够安全、负责任地控制住死亡,就不会发生。下面的文章由David Chiszar,Hobart M.Smith撰写,从不在HTML中使用正则表达式。使用HTML解析器。Python有几个选项。事实上,在HTML解析器中,可能重复预期操作(删除一堆黑名单元素)是非常简单的。比任何字符串操作都简单得多。环顾四周,这个问题已经不是第一次被问到了。忽略所有建议使用stringreplace或regex的解决方案,您会没事的。
import bs4

txt = '<p>Large pythons were <pre><code> the subject of many </code></pre> a news story </p>\n last year due to the fact that there were at least two deaths <pre><code> directly attributable to them </code></pre>. Tragically, the deaths would not have happened had <pre><code> the owners of these snakes kept them </code></pre> safely, and responsibly, contained. The following article, by David Chiszar, Hobart M. Smith, <a href= Albert Petkus and Joseph Dougherty </a>, was recently published in the Bulletin of the Chicago Herpetological Society, and represents the first clear, and accurate, <p> account of the death that occurred July 1993</p>\n'
soup = bs4.BeautifulSoup(txt, "html.parser")
for tag in soup.find_all('pre'):
    if tag.find('code'):
        tag.decompose()

result = str(soup)