Python美化组同名DIV,先忽略

Python美化组同名DIV,先忽略,python,beautifulsoup,Python,Beautifulsoup,因此,我们将使用python bs4,并尝试解决如何忽略相同的DIV名称来收集第二批数据的问题 下面是我试图提取所需数据的代码示例## ##无意义的数据### 4.44美元 peek 2.33美元 关闭窗口 ##通缉数据## 8.88美元 peek 9.99美元 关闭窗口 8.88美元 peek 7.77美元 关闭窗口 从bs4导入美化组 soup=BeautifulSoup(“,“html.parser”) div=soup.find(“div”) div.find_all(“div”,“c

因此,我们将使用python bs4,并尝试解决如何忽略相同的DIV名称来收集第二批数据的问题

下面是我试图提取所需数据的代码示例##

##无意义的数据###

4.44美元

peek

2.33美元

关闭窗口

##通缉数据##

8.88美元

peek

9.99美元

关闭窗口

8.88美元

peek

7.77美元

关闭窗口

从bs4导入美化组
soup=BeautifulSoup(“,“html.parser”)
div=soup.find(“div”)
div.find_all(“div”,“class”:“PowerDetails”})
PowerDetails[1]。查找所有(“p”,“类”:“运行成本”)
PowerDetails[1]。查找所有(“p”,“类”:“时间”)

find_all()
将返回
list
。使用切片或索引只访问所需的元素。

您可以对结果列表进行切片,以从第一个索引开始获取元素。但是,首先,您没有在代码中找到正确的标记

从bs4导入美化组
html_doc=“”

4.44美元

peek

2.33美元

关闭窗口

##通缉数据##

8.88美元

peek

9.99美元

关闭窗口

8.88美元

peek

7.77美元

关闭窗口

""" soup=BeautifulSoup(html\u doc,“html.parser”) #只需一行代码就可以得到div powerDetails=soup.find_all(class=“powerDetails”) 打印(len(powerDetails))#输出2
现在,您可以对列表进行切片以忽略第一个div

powerDetails=powerDetails[1:]#从第二个元素开始获取元素(忽略第一个元素)
打印(len(powerDetails))#输出1
现在,您将拥有一个只包含一个元素的列表

打印(powerDetails)
输出:

[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]
[

8.88美元

peek

9.99美元

关闭窗口

8.88美元

peek

7.77美元

关闭窗口

]
另一种方法

from simplified_scrapy import SimplifiedDoc

html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
        <p class="RunningCost">$4.44</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $2.33</p>           
        <p class="Time">Off-peek</p>
</div>
</div>

##Wanted data##
<div class="PowerDetails">
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $9.99</p>           
        <p class="Time">Off-peek</p>
  </div>
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $7.77</p>           
        <p class="Time">Off-peek</p>
  </div>
</div>
'''

doc = SimplifiedDoc(html)

# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
    'div.Company').selects('p')
for ps in PowerDetails:
    print([(p['class'], p.text) for p in ps])

# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
    'div', value='PowerDetails',
    start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
    print([(p['class'], p.text) for p in ps])
[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]
from simplified_scrapy import SimplifiedDoc

html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
        <p class="RunningCost">$4.44</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $2.33</p>           
        <p class="Time">Off-peek</p>
</div>
</div>

##Wanted data##
<div class="PowerDetails">
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $9.99</p>           
        <p class="Time">Off-peek</p>
  </div>
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $7.77</p>           
        <p class="Time">Off-peek</p>
  </div>
</div>
'''

doc = SimplifiedDoc(html)

# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
    'div.Company').selects('p')
for ps in PowerDetails:
    print([(p['class'], p.text) for p in ps])

# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
    'div', value='PowerDetails',
    start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
    print([(p['class'], p.text) for p in ps])
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]