Python 使用.find()提取2个相同的';分区';从带有BS4的html页面
我试图从一个soup元素中提取2个相同的'div'中的第二个。当使用.find()方法进行分析和提取时,它只从顶部获取第一个。如果满足某些条件,如何告诉脚本跳过第一个脚本并获取下一个脚本?下面是我想从中提取的html代码Python 使用.find()提取2个相同的';分区';从带有BS4的html页面,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图从一个soup元素中提取2个相同的'div'中的第二个。当使用.find()方法进行分析和提取时,它只从顶部获取第一个。如果满足某些条件,如何告诉脚本跳过第一个脚本并获取下一个脚本?下面是我想从中提取的html代码 <div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div> &
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
然而,结果仍然是:
NOT IN
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
不在
MPAA评级:PG(建议家长指导)
而不是这个:
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
$0.00,在主要视频频道上进行CONtv试用
有什么建议吗?您需要
查找所有
然后作为find
索引到返回的列表中,只返回第一个匹配项。您可以使用选择执行相同的操作。使用bs4.7.1。您可以使用:contains
通过子字符串(例如CONtv-trial
)将元素的内部文本
作为目标,然后使用选择一个
(如果需要第一个匹配项),或者如果有多个匹配项,则使用选择
。在尝试访问.text
from bs4 import BeautifulSoup as bs
import requests
html = '''
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
'''
soup = bs(html, 'lxml')
print(soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})[1].text)
print(soup.select('.a-color-secondary')[1].text)
print(soup.select_one('.a-color-secondary:contains("CONtv trial")').text)
假设div现在直接位于
下,那么可以使用标准的Python索引。在实际代码中,用适当的元素替换选择器中的正文
:
data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('body > div')[1].text.strip())
注意
登录select()
这意味着我们希望所有
都直接在
下
matches = soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})
for item in matches:
if '$' in str(item):
print(item.text)
data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('body > div')[1].text.strip())
$0.00 with a CONtv trial on Prime Video Channels