Python 3.x 如何使用BeautifulSoup解析嵌套的标记_Python 3.x_Beautifulsoup

Python 3.x 如何使用BeautifulSoup解析嵌套的标记

python-3.x

Python 3.x 如何使用BeautifulSoup解析嵌套的标记,python-3.x,beautifulsoup,Python 3.x,Beautifulsoup,HTML代码 <a href="1.co">1<a href="2.co">2</a></a> Python代码 from bs4 import BeautifulSoup from bs4 import SoupStrainer def parse(text): soup = BeautifulSoup(text, parse_only=SoupStrainer(['a']), features="html.parser")

HTML代码

<a href="1.co">1<a href="2.co">2</a></a>

Python代码

from bs4 import BeautifulSoup
from bs4 import SoupStrainer


def parse(text):
    soup = BeautifulSoup(text, parse_only=SoupStrainer(['a']), features="html.parser")
    for tag in soup:
        if tag.name == "a" and tag.has_attr("href"):
            print(tag["href"])
        if hasattr(tag, "contents"):
            for text in tag.contents:
                parse(text)

if __name__ == '__main__':
    parse("""<a href="2.co">2<a href="3.co">3</a></a>""")

一定要找到“a”

from bs4 import BeautifulSoup
data='''<a href="1.co">1<a href="2.co">2</a></a>'''
soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all('a',href=True):
    print(item['href'])

一定要找到“a”

from bs4 import BeautifulSoup
data='''<a href="1.co">1<a href="2.co">2</a></a>'''
soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all('a',href=True):
    print(item['href'])

调用str修复了这个问题

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

def parse(text):
    soup = BeautifulSoup(text, parse_only=SoupStrainer(['a']), features="html.parser")
    for tag in soup:
        if tag.name == "a" and tag.has_attr("href"):
            print(tag["href"])
        if hasattr(tag, "contents"):
            for text in tag.contents:
                parse(str(text))  # This is where the bug was

if __name__ == '__main__':
    parse("""<a href="2.co">2<a href="3.co">3</a></a>""")

调用str修复了这个问题

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

def parse(text):
    soup = BeautifulSoup(text, parse_only=SoupStrainer(['a']), features="html.parser")
    for tag in soup:
        if tag.name == "a" and tag.has_attr("href"):
            print(tag["href"])
        if hasattr(tag, "contents"):
            for text in tag.contents:
                parse(str(text))  # This is where the bug was

if __name__ == '__main__':
    parse("""<a href="2.co">2<a href="3.co">3</a></a>""")

如果您想要所有标签，那么按照建议，使用.find_ALL'a。但是，如果您特别想要嵌套的标记，那么您可以执行当前正在执行的操作，但是在每个标记中，您希望找到带有标记的子项：

你的预期产量是多少？你能帮我一个忙吗？你的预期产量是多少？你能帮我一个忙吗？