Python 如何使用BeautifulSoup根据标记的子级和同级选择标记?

Python 如何使用BeautifulSoup根据标记的子级和同级选择标记?,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我试图从2012年奥巴马-罗姆尼总统辩论中摘录一些话。问题是组织得不好。因此,结构如下所示: <span class="displaytext"> <p> <i>OBAMA</i>Obama's first quotes </p> <p>More quotes from Obama</p> <p>Some more Obama quotes</p&

我试图从2012年奥巴马-罗姆尼总统辩论中摘录一些话。问题是组织得不好。因此,结构如下所示:

<span class="displaytext">
    <p>
        <i>OBAMA</i>Obama's first quotes
    </p>
    <p>More quotes from Obama</p>
    <p>Some more Obama quotes</p>

    <p>
        <i>Moderator</i>Moderator's quotes
    </p>
    <p>Some more quotes</p>

    <p>
        <i>ROMNEY</i>Romney's quotes
    </p>
    <p>More quotes from Romney</p>
    <p>Some more Romney quotes</p>
</span>
for i in president_quotes:
    print(i.next_sibling)
    siblings = i.parent.find_next_siblings('p')
    for sibling in siblings:
        if sibling.find("i"):
            break
        print(sibling.string)
它只打印了奥巴马的第一句话

我认为一种类似的解决方案会在这里奏效。像这样:

soup = BeautifulSoup(input, 'lxml')
debate_text = soup.find("span", { "class" : "displaytext" })
obama_is_on = False
obama_tags = []
for p in debate_text("p"):
    if p.i and 'OBAMA' in p.i:
        # assuming <i> is used only to indicate speaker
        obama_is_on = True
    if p.i and 'OBAMA' not in p.i:
        obama_is_on = False
        continue
    if obama_is_on:
        obama_tags.append(p)
print(obama_tags)

[<p>
<i>OBAMA</i>Obama's first quotes
        </p>, <p>More quotes from Obama</p>, <p>Some more Obama quotes</p>]

奥巴马的其他引语是p的兄弟姐妹,而不是i的兄弟姐妹,所以你需要找到我父母的兄弟姐妹。当你在这些兄弟姐妹之间循环时,当其中一个有i时,你可以停止。大概是这样的:

<span class="displaytext">
    <p>
        <i>OBAMA</i>Obama's first quotes
    </p>
    <p>More quotes from Obama</p>
    <p>Some more Obama quotes</p>

    <p>
        <i>Moderator</i>Moderator's quotes
    </p>
    <p>Some more quotes</p>

    <p>
        <i>ROMNEY</i>Romney's quotes
    </p>
    <p>More quotes from Romney</p>
    <p>Some more Romney quotes</p>
</span>
for i in president_quotes:
    print(i.next_sibling)
    siblings = i.parent.find_next_siblings('p')
    for sibling in siblings:
        if sibling.find("i"):
            break
        print(sibling.string)
其中打印:

奥巴马的第一句话 更多引用奥巴马的话 更多的奥巴马语录