Python美化组同名DIV,先忽略
因此,我们将使用python bs4,并尝试解决如何忽略相同的DIV名称来收集第二批数据的问题 下面是我试图提取所需数据的代码示例##Python美化组同名DIV,先忽略,python,beautifulsoup,Python,Beautifulsoup,因此,我们将使用python bs4,并尝试解决如何忽略相同的DIV名称来收集第二批数据的问题 下面是我试图提取所需数据的代码示例## ##无意义的数据### 4.44美元 peek 2.33美元 关闭窗口 ##通缉数据## 8.88美元 peek 9.99美元 关闭窗口 8.88美元 peek 7.77美元 关闭窗口 从bs4导入美化组 soup=BeautifulSoup(“,“html.parser”) div=soup.find(“div”) div.find_all(“div”,“c
##无意义的数据###
4.44美元
peek
2.33美元
关闭窗口
##通缉数据##
8.88美元
peek
9.99美元
关闭窗口
8.88美元
peek
7.77美元
关闭窗口
从bs4导入美化组
soup=BeautifulSoup(“,“html.parser”)
div=soup.find(“div”)
div.find_all(“div”,“class”:“PowerDetails”})
PowerDetails[1]。查找所有(“p”,“类”:“运行成本”)
PowerDetails[1]。查找所有(“p”,“类”:“时间”)
find_all()
将返回list
。使用切片或索引只访问所需的元素。您可以对结果列表进行切片,以从第一个索引开始获取元素。但是,首先,您没有在代码中找到正确的标记
从bs4导入美化组
html_doc=“”
4.44美元
peek
2.33美元
关闭窗口
##通缉数据##
8.88美元
peek
9.99美元
关闭窗口
8.88美元
peek
7.77美元
关闭窗口
"""
soup=BeautifulSoup(html\u doc,“html.parser”)
#只需一行代码就可以得到div
powerDetails=soup.find_all(class=“powerDetails”)
打印(len(powerDetails))#输出2
现在,您可以对列表进行切片以忽略第一个div
powerDetails=powerDetails[1:]#从第二个元素开始获取元素(忽略第一个元素)
打印(len(powerDetails))#输出1
现在,您将拥有一个只包含一个元素的列表
打印(powerDetails)
输出:
[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]
[
8.88美元
peek
9.99美元
关闭窗口
8.88美元
peek
7.77美元
关闭窗口
]
另一种方法
from simplified_scrapy import SimplifiedDoc
html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
'''
doc = SimplifiedDoc(html)
# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
'div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
'div', value='PowerDetails',
start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]
from simplified_scrapy import SimplifiedDoc
html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
'''
doc = SimplifiedDoc(html)
# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
'div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
'div', value='PowerDetails',
start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]