Python 在两个未测试的标签之间刮掉所有东西

Python 在两个未测试的标签之间刮掉所有东西,python,html,beautifulsoup,Python,Html,Beautifulsoup,有可能在两个未测试的标签之间刮掉所有东西吗 例如: <h3>Title 1<h3> <div class="div"> <span class="span">span1</span> <label class="label">label1</label> </div> <div class="div"

有可能在两个未测试的标签之间刮掉所有东西吗

例如:

<h3>Title 1<h3>
<div class="div">
    <span class="span">span1</span>
    <label class="label">label1</label>
</div>
<div class="div">
    <span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
    <span class="span">span3</span>
    <label class="label">label2</label>
</div>
<div id="div">
    <span id="span">span4</span>
</div>
现在我得到:

span1
span2
span3
span4
我想得到:

span1
span2

我不知道这是否是这个问题的最佳解决方案,但你可以分割你的文本,只刮你需要的部分

text = """
<h3>Title 1</h3>
<div class="div">
    <span class="span">span1</span>
    <label class="label">label1</label>
</div>
<div class="div">
    <span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
    <span class="span">span3</span>
    <label class="label">label2</label>
</div>
<div id="div">
    <span id="span">span4</span>
</div>
"""

sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
一种方法是:

  • 找到第二个
    class=“span”
    ,然后向后导航,然后单击
    div

  • 标记的顺序是倒序的,因此请使用函数

  • 查找
    标记



  • 美丽的汤为您生成一个树形结构。您可以确定上下文对您很重要的那两个子项,并在父项的子项中从一个迭代到另一个。
    text = """
    <h3>Title 1</h3>
    <div class="div">
        <span class="span">span1</span>
        <label class="label">label1</label>
    </div>
    <div class="div">
        <span class="span">span2</span>
    </div>
    <h3>Title 2</h3>
    <div class="div">
        <span class="span">span3</span>
        <label class="label">label2</label>
    </div>
    <div id="div">
        <span id="span">span4</span>
    </div>
    """
    
    sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
    
    '"\n<h3>Title 1</h3>\n<div class="div">\n    <span class="span">span1</span>\n    <label class="label">label1</label>\n</div>\n<div class="div">\n    <span class="span">span2</span>\n</div>\n<h3>'
    
    scrape_me = BeautifulSoup(sub_text, 'lxml')
    
    for i in scrape_me.findAll("div", class_="div"):
        print(i.span.text)
    # -> span1 span2
    
    from bs4 import BeautifulSoup
    
    html = """
    <h3>Title 1<h3>
    <div class="div">
        <span class="span">span1</span>
        <label class="label">label1</label>
    </div>
    <div class="div">
        <span class="span">span2</span>
    </div>
    <h3>Title 2<h3>
    <div class="div">
        <span class="span">span3</span>
        <label class="label">label2</label>
    </div>
    <div id="div">
        <span id="span">span4</span>
    </div>
    """
    soup = BeautifulSoup(html, "lxml")
    
    for tag in reversed(
        soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
    ):
        print(tag.find("span").text)
    
    span1
    span2