Python 如何使用BeautifulSoup根据标记的子级和同级选择标记?
我试图从2012年奥巴马-罗姆尼总统辩论中摘录一些话。问题是组织得不好。因此,结构如下所示:Python 如何使用BeautifulSoup根据标记的子级和同级选择标记?,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我试图从2012年奥巴马-罗姆尼总统辩论中摘录一些话。问题是组织得不好。因此,结构如下所示: <span class="displaytext"> <p> <i>OBAMA</i>Obama's first quotes </p> <p>More quotes from Obama</p> <p>Some more Obama quotes</p&
<span class="displaytext">
<p>
<i>OBAMA</i>Obama's first quotes
</p>
<p>More quotes from Obama</p>
<p>Some more Obama quotes</p>
<p>
<i>Moderator</i>Moderator's quotes
</p>
<p>Some more quotes</p>
<p>
<i>ROMNEY</i>Romney's quotes
</p>
<p>More quotes from Romney</p>
<p>Some more Romney quotes</p>
</span>
for i in president_quotes:
print(i.next_sibling)
siblings = i.parent.find_next_siblings('p')
for sibling in siblings:
if sibling.find("i"):
break
print(sibling.string)
它只打印了奥巴马的第一句话我认为一种类似的解决方案会在这里奏效。像这样:
soup = BeautifulSoup(input, 'lxml')
debate_text = soup.find("span", { "class" : "displaytext" })
obama_is_on = False
obama_tags = []
for p in debate_text("p"):
if p.i and 'OBAMA' in p.i:
# assuming <i> is used only to indicate speaker
obama_is_on = True
if p.i and 'OBAMA' not in p.i:
obama_is_on = False
continue
if obama_is_on:
obama_tags.append(p)
print(obama_tags)
[<p>
<i>OBAMA</i>Obama's first quotes
</p>, <p>More quotes from Obama</p>, <p>Some more Obama quotes</p>]
奥巴马的其他引语是p的兄弟姐妹,而不是i的兄弟姐妹,所以你需要找到我父母的兄弟姐妹。当你在这些兄弟姐妹之间循环时,当其中一个有i时,你可以停止。大概是这样的:
<span class="displaytext">
<p>
<i>OBAMA</i>Obama's first quotes
</p>
<p>More quotes from Obama</p>
<p>Some more Obama quotes</p>
<p>
<i>Moderator</i>Moderator's quotes
</p>
<p>Some more quotes</p>
<p>
<i>ROMNEY</i>Romney's quotes
</p>
<p>More quotes from Romney</p>
<p>Some more Romney quotes</p>
</span>
for i in president_quotes:
print(i.next_sibling)
siblings = i.parent.find_next_siblings('p')
for sibling in siblings:
if sibling.find("i"):
break
print(sibling.string)
其中打印:
奥巴马的第一句话
更多引用奥巴马的话
更多的奥巴马语录