Python 如何使用BeautifulSoup根据标记的子级和同级选择标记？_Python_Python 3.x_Web Scraping_Beautifulsoup

Python 如何使用BeautifulSoup根据标记的子级和同级选择标记？

python python-3.x web-scraping

Python 如何使用BeautifulSoup根据标记的子级和同级选择标记？,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我试图从2012年奥巴马-罗姆尼总统辩论中摘录一些话。问题是组织得不好。因此，结构如下所示： OBAMAObama's first quotes More quotes from Obama Some more Obama quotes</p&

我试图从2012年奥巴马-罗姆尼总统辩论中摘录一些话。问题是组织得不好。因此，结构如下所示：

<span class="displaytext">
    <p>
        <i>OBAMA</i>Obama's first quotes
    </p>
    <p>More quotes from Obama</p>
    <p>Some more Obama quotes</p>

    <p>
        <i>Moderator</i>Moderator's quotes
    </p>
    <p>Some more quotes</p>

    <p>
        <i>ROMNEY</i>Romney's quotes
    </p>
    <p>More quotes from Romney</p>
    <p>Some more Romney quotes</p>
</span>

for i in president_quotes:
    print(i.next_sibling)
    siblings = i.parent.find_next_siblings('p')
    for sibling in siblings:
        if sibling.find("i"):
            break
        print(sibling.string)

它只打印了奥巴马的第一句话

我认为一种类似的解决方案会在这里奏效。像这样：

soup = BeautifulSoup(input, 'lxml')
debate_text = soup.find("span", { "class" : "displaytext" })
obama_is_on = False
obama_tags = []
for p in debate_text("p"):
    if p.i and 'OBAMA' in p.i:
        # assuming <i> is used only to indicate speaker
        obama_is_on = True
    if p.i and 'OBAMA' not in p.i:
        obama_is_on = False
        continue
    if obama_is_on:
        obama_tags.append(p)
print(obama_tags)

[<p>
<i>OBAMA</i>Obama's first quotes
        </p>, <p>More quotes from Obama</p>, <p>Some more Obama quotes</p>]

奥巴马的其他引语是p的兄弟姐妹，而不是i的兄弟姐妹，所以你需要找到我父母的兄弟姐妹。当你在这些兄弟姐妹之间循环时，当其中一个有i时，你可以停止。大概是这样的：

<span class="displaytext">
    <p>
        <i>OBAMA</i>Obama's first quotes
    </p>
    <p>More quotes from Obama</p>
    <p>Some more Obama quotes</p>

    <p>
        <i>Moderator</i>Moderator's quotes
    </p>
    <p>Some more quotes</p>

    <p>
        <i>ROMNEY</i>Romney's quotes
    </p>
    <p>More quotes from Romney</p>
    <p>Some more Romney quotes</p>
</span>

for i in president_quotes:
    print(i.next_sibling)
    siblings = i.parent.find_next_siblings('p')
    for sibling in siblings:
        if sibling.find("i"):
            break
        print(sibling.string)

其中打印：

奥巴马的第一句话更多引用奥巴马的话更多的奥巴马语录