Python 使用beautifulsoup按正确顺序分析项目符号列表

Python 使用beautifulsoup按正确顺序分析项目符号列表,python,beautifulsoup,Python,Beautifulsoup,我正在尝试解析一个网站,它的结构与此非常相似: <div class="InternaTesto"> <p class="MarginTop0">Paragraph 1</p><br> <p>Paragraph 2</p><br> <p><strong>Paragraph 3</strong></p><br> <ul> <li

我正在尝试解析一个网站,它的结构与此非常相似:

<div class="InternaTesto">
<p class="MarginTop0">Paragraph 1</p><br>
<p>Paragraph 2</p><br>
<p><strong>Paragraph 3</strong></p><br>
<ul>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 1</em></li>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 2</em></li>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 3</em></li>
    ... Some Other Items ...
</ul>
<p><strong>Paragraph 4</strong></p><br>
<ul>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 1</em></li>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 2</em></li>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 3</em></li>
    ... Some Other Items ...
</ul>
... Some Other paragraphs ...
</div>

有没有办法创建包含所有
  • 项目的子列表或单独列表?

    您可以找到所有段落,并为每个段落获得下一个第三个同级:

    from bs4 import BeautifulSoup
    
    data = """
    Your html here
    """
    
    soup = BeautifulSoup(data)
    for p in soup.find('div', attrs={'class':'InternaTesto'}).find_all("p"):
        print p.text, [li.text for li in list(p.next_siblings)[2].find_all('li')]
    
    印刷品:

    Paragraph 1 []
    Paragraph 2 []
    Paragraph 3 [u'List item 1', u'List item 2', u'List item 3']
    Paragraph 4 [u'List item 1', u'List item 2', u'List item 3']
    

    更可靠的方法是迭代每个段落的下一个同级,直到找到下一个段落标记:

    soup = BeautifulSoup(data)
    for p in soup.find('div', attrs={'class':'InternaTesto'}).find_all("p"):
        print p.text
        for sibling in p.next_siblings:
            if sibling.name == 'ul':
                print [li.text for li in sibling.find_all('li')]
            if sibling.name == 'p':
                break
    
    希望有帮助

    soup = BeautifulSoup(data)
    for p in soup.find('div', attrs={'class':'InternaTesto'}).find_all("p"):
        print p.text
        for sibling in p.next_siblings:
            if sibling.name == 'ul':
                print [li.text for li in sibling.find_all('li')]
            if sibling.name == 'p':
                break