Python 使用.find()提取2个相同的';分区';从带有BS4的html页面

Python 使用.find()提取2个相同的';分区';从带有BS4的html页面,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图从一个soup元素中提取2个相同的'div'中的第二个。当使用.find()方法进行分析和提取时,它只从顶部获取第一个。如果满足某些条件,如何告诉脚本跳过第一个脚本并获取下一个脚本?下面是我想从中提取的html代码 <div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div> &

我试图从一个soup元素中提取2个相同的'div'中的第二个。当使用.find()方法进行分析和提取时,它只从顶部获取第一个。如果满足某些条件,如何告诉脚本跳过第一个脚本并获取下一个脚本?下面是我想从中提取的html代码

<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
然而,结果仍然是:

NOT IN
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
不在
MPAA评级:PG(建议家长指导)
而不是这个:

<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div> 
$0.00,在主要视频频道上进行CONtv试用

有什么建议吗?

您需要
查找所有
然后作为
find
索引到返回的列表中,只返回第一个匹配项。您可以使用
选择
执行相同的操作。使用bs4.7.1。您可以使用
:contains
通过子字符串(例如
CONtv-trial
)将元素的
内部文本
作为目标,然后使用
选择一个
(如果需要第一个匹配项),或者如果有多个匹配项,则使用
选择
。在尝试访问
.text

from bs4 import BeautifulSoup as bs
import requests

html = '''
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
'''
soup = bs(html, 'lxml')
print(soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})[1].text)
print(soup.select('.a-color-secondary')[1].text)
print(soup.select_one('.a-color-secondary:contains("CONtv trial")').text)

假设div现在直接位于
下,那么可以使用标准的Python索引。在实际代码中,用适当的元素替换选择器中的
正文

data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data, 'lxml')

print(soup.select('body > div')[1].text.strip())
注意
登录
select()
这意味着我们希望所有
都直接

matches = soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})
for item in matches:
    if '$' in str(item):
        print(item.text)
data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data, 'lxml')

print(soup.select('body > div')[1].text.strip())
$0.00 with a CONtv trial on Prime Video Channels