Python BeautifulSoup如何删除文本具有特定值的标记_Python_Beautifulsoup

Python BeautifulSoup如何删除文本具有特定值的标记

python

Python BeautifulSoup如何删除文本具有特定值的标记,python,beautifulsoup,Python,Beautifulsoup,我正试图从维基百科中摘取一些文章，发现有一些条目我想排除在下面的例子中，我想排除两个a标记，它们的内容等于Archived或Wayback Machine。没有必要将文本作为因素。我发现href值也可用作urlarchive.org或/wiki/Wayback\u机器上的排除项 <li id="cite_note-22"> <span class="mw-cite-backlink"> <b>

我正试图从维基百科中摘取一些文章，发现有一些条目我想排除

在下面的例子中，我想排除两个

标记，它们的内容等于

Archived

或

Wayback Machine

。没有必要将文本作为因素。我发现href值也可用作url

archive.org

或

/wiki/Wayback\u机器上的排除项
<li id="cite_note-22">
    <span class="mw-cite-backlink">
        <b>
            <a href="#cite_ref-22" aria-label="Jump up" title="Jump up">^</a>
        </b>
    </span> 
    <span class="reference-text">
        <a rel="nofollow" class="external text" href="https://www.somelink.com">Article Text I want to keep</a> 
        <a rel="nofollow" class="external text" href="https://www.someotherlink.com">Archived</a>
        <a href="/wiki/Wayback_Machine" title="Wayback Machine">Wayback Machine</a>
    </span>
</li>

我也尝试使用排除
，但我有类似的问题
有没有更好的方法可以忽略这些链接？
您可以尝试以下方法：
import re
from bs4 import BeautifulSoup

html = """<li id="cite_note-22">
    <span class="mw-cite-backlink">
        <b>
            <a href="#cite_ref-22" aria-label="Jump up" title="Jump up">^</a>
        </b>
    </span> 
    <span class="reference-text">
        <a rel="nofollow" class="external text" href="https://www.somelink.com">Article Text I want to keep</a> 
        <a rel="nofollow" class="external text" href="https://www.someotherlink.com">Archived</a>
        <a href="/wiki/Wayback_Machine" title="Wayback Machine">Wayback Machine</a>
    </span>
</li>"""

soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all(lambda t: t.name == 'a' and not re.search(r'Wayback|Archived|\^', t.text)):
    print(f"{anchor.text} - {anchor.get('href')}")

编辑以回答评论：
您可以使用的attrs=
通过class
和text
进行匹配。find_all（）
并将regex条件放入循环中
soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all("a", attrs={"class": "external text"}):
    if not re.search(r'Wayback|Archived', anchor.text):
        print(f"{anchor.text} - {anchor.get('href')}")

输出：
Article Text I want to keep - https://www.somelink.com

Article Text I want to keep - https://www.somelink.com

这对于删除任何具有匹配文本的内容都非常有效。我如何匹配href？它会是而不是重新搜索（r'archive.org\^'，t.href）
？你会使用而不是重新搜索（r'archive\.org'，t.get（“href”）
但是你没有任何href
s与archive.org
匹配，因此可以匹配你的html
示例中的所有内容。谢谢你，这些都成功了。我会将其标记为已修复。顺便提一下，我如何获得该类。它与href与t.get（“类”）更新了答案@TheMightyLlama
Article Text I want to keep - https://www.somelink.com