Python 美化将HTML解析到字典中,其中<;h>;是关键和<;p>;价值是多少

Python 美化将HTML解析到字典中,其中<;h>;是关键和<;p>;价值是多少,python,html,beautifulsoup,Python,Html,Beautifulsoup,我正在尝试抓取一个网站,并将网站中的一些数据解析为可用的格式进行分析。我要提取的数据位于块中,因此我可以轻松地访问HTML的该部分 我希望最终得到一个列表的python字典,如下所示: {"Processor": ["3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache"], "Memory": ["16GB (four 4GB) of 1866MHz DDR3 EEC"], "Storage": ["512GB PCI

我正在尝试抓取一个网站,并将网站中的一些数据解析为可用的格式进行分析。我要提取的数据位于
块中,因此我可以轻松地访问HTML的该部分

我希望最终得到一个列表的python字典,如下所示:

{"Processor": ["3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache"],
 "Memory": ["16GB (four 4GB) of 1866MHz DDR3 EEC"],
 "Storage": ["512GB PCIe-based flash storage"],
 "Input/Output": ["Four USB 3 ports (up to 5 Gbps)", "Six Thunderbolt 2 ports (up to 20 Gbps)"]}
<div class="as-productinfosection-panel TechSpecs-panel row">
   <div class="as-productinfosection-sidepanel column large-3 small-12">
      <h3 data-autom="sectionTitle">Tech Specs</h3>
   </div>
   <div class="as-productinfosection-mainpanel column large-9 small-12">
      <h4 class="h4-para-title">
         Processor
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache
         </p>
      </div>
      <h4 class="h4-para-title">
         Memory
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            16GB (four 4GB) of 1866MHz DDR3 EEC
         </p>
      </div>
      <h4 class="h4-para-title">
         Storage
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            512GB PCIe-based flash storage<sup>1</sup>
         </p>
      </div>
      <h4 class="h4-para-title">
         Graphics
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Dual AMD FirePro D300 graphics processors with 2GB of GDDR5 VRAM each
         </p>
      </div>
      <h4 class="h4-para-title">
         Input/Output
      </h4>
      <div class="para-list">
         <p>
            Four USB 3 ports (up to 5 Gbps)
         </p>
      </div>
      <div class="para-list">
         <p>
            Six Thunderbolt 2 ports (up to 20 Gbps)
         </p>
      </div>
      <div class="para-list">
         <p>
            Dual Gigabit Ethernet ports
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            One HDMI 1.4 Ultra HD port
         </p>
      </div>
      <h4 class="h4-para-title">
         Audio
      </h4>
      <div class="para-list">
         <p>
            Combined optical digital audio output/analog line out minijack
         </p>
      </div>
      <div class="para-list">
         <p>
            Headphone minijack with headset support
         </p>
      </div>
      <div class="para-list">
         <p>
            HDMI port supports multichannel audio output
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Built-in speaker
         </p>
      </div>
      <h4 class="h4-para-title">
         Wireless
      </h4>
      <div class="para-list">
         <p>
            802.11ac Wi-Fi wireless networking<sup>2</sup>
         </p>
      </div>
      <div class="para-list">
         <p>
            IEEE 802.11a/b/g/n compatible
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Bluetooth 4.0 wireless technology
         </p>
      </div>
      <h4 class="h4-para-title">
         Size and weight
      </h4>
      <div class="para-list">
         <p>
            Height: 9.9 inches (25.1 cm)
         </p>
      </div>
      <div class="para-list">
         <p>
            Diameter: 6.6 inches (16.7 cm)
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Weight: 11 pounds (5 kg)<sup>3</sup>
         </p>
      </div>
   </div>
</div>
section.findAll('p')
但是,我试图解析的HTML数据看起来更像这样:

{"Processor": ["3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache"],
 "Memory": ["16GB (four 4GB) of 1866MHz DDR3 EEC"],
 "Storage": ["512GB PCIe-based flash storage"],
 "Input/Output": ["Four USB 3 ports (up to 5 Gbps)", "Six Thunderbolt 2 ports (up to 20 Gbps)"]}
<div class="as-productinfosection-panel TechSpecs-panel row">
   <div class="as-productinfosection-sidepanel column large-3 small-12">
      <h3 data-autom="sectionTitle">Tech Specs</h3>
   </div>
   <div class="as-productinfosection-mainpanel column large-9 small-12">
      <h4 class="h4-para-title">
         Processor
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache
         </p>
      </div>
      <h4 class="h4-para-title">
         Memory
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            16GB (four 4GB) of 1866MHz DDR3 EEC
         </p>
      </div>
      <h4 class="h4-para-title">
         Storage
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            512GB PCIe-based flash storage<sup>1</sup>
         </p>
      </div>
      <h4 class="h4-para-title">
         Graphics
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Dual AMD FirePro D300 graphics processors with 2GB of GDDR5 VRAM each
         </p>
      </div>
      <h4 class="h4-para-title">
         Input/Output
      </h4>
      <div class="para-list">
         <p>
            Four USB 3 ports (up to 5 Gbps)
         </p>
      </div>
      <div class="para-list">
         <p>
            Six Thunderbolt 2 ports (up to 20 Gbps)
         </p>
      </div>
      <div class="para-list">
         <p>
            Dual Gigabit Ethernet ports
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            One HDMI 1.4 Ultra HD port
         </p>
      </div>
      <h4 class="h4-para-title">
         Audio
      </h4>
      <div class="para-list">
         <p>
            Combined optical digital audio output/analog line out minijack
         </p>
      </div>
      <div class="para-list">
         <p>
            Headphone minijack with headset support
         </p>
      </div>
      <div class="para-list">
         <p>
            HDMI port supports multichannel audio output
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Built-in speaker
         </p>
      </div>
      <h4 class="h4-para-title">
         Wireless
      </h4>
      <div class="para-list">
         <p>
            802.11ac Wi-Fi wireless networking<sup>2</sup>
         </p>
      </div>
      <div class="para-list">
         <p>
            IEEE 802.11a/b/g/n compatible
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Bluetooth 4.0 wireless technology
         </p>
      </div>
      <h4 class="h4-para-title">
         Size and weight
      </h4>
      <div class="para-list">
         <p>
            Height: 9.9 inches (25.1 cm)
         </p>
      </div>
      <div class="para-list">
         <p>
            Diameter: 6.6 inches (16.7 cm)
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Weight: 11 pounds (5 kg)<sup>3</sup>
         </p>
      </div>
   </div>
</div>
section.findAll('p')
我可以找到所有这样的段落:

{"Processor": ["3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache"],
 "Memory": ["16GB (four 4GB) of 1866MHz DDR3 EEC"],
 "Storage": ["512GB PCIe-based flash storage"],
 "Input/Output": ["Four USB 3 ports (up to 5 Gbps)", "Six Thunderbolt 2 ports (up to 20 Gbps)"]}
<div class="as-productinfosection-panel TechSpecs-panel row">
   <div class="as-productinfosection-sidepanel column large-3 small-12">
      <h3 data-autom="sectionTitle">Tech Specs</h3>
   </div>
   <div class="as-productinfosection-mainpanel column large-9 small-12">
      <h4 class="h4-para-title">
         Processor
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache
         </p>
      </div>
      <h4 class="h4-para-title">
         Memory
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            16GB (four 4GB) of 1866MHz DDR3 EEC
         </p>
      </div>
      <h4 class="h4-para-title">
         Storage
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            512GB PCIe-based flash storage<sup>1</sup>
         </p>
      </div>
      <h4 class="h4-para-title">
         Graphics
      </h4>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Dual AMD FirePro D300 graphics processors with 2GB of GDDR5 VRAM each
         </p>
      </div>
      <h4 class="h4-para-title">
         Input/Output
      </h4>
      <div class="para-list">
         <p>
            Four USB 3 ports (up to 5 Gbps)
         </p>
      </div>
      <div class="para-list">
         <p>
            Six Thunderbolt 2 ports (up to 20 Gbps)
         </p>
      </div>
      <div class="para-list">
         <p>
            Dual Gigabit Ethernet ports
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            One HDMI 1.4 Ultra HD port
         </p>
      </div>
      <h4 class="h4-para-title">
         Audio
      </h4>
      <div class="para-list">
         <p>
            Combined optical digital audio output/analog line out minijack
         </p>
      </div>
      <div class="para-list">
         <p>
            Headphone minijack with headset support
         </p>
      </div>
      <div class="para-list">
         <p>
            HDMI port supports multichannel audio output
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Built-in speaker
         </p>
      </div>
      <h4 class="h4-para-title">
         Wireless
      </h4>
      <div class="para-list">
         <p>
            802.11ac Wi-Fi wireless networking<sup>2</sup>
         </p>
      </div>
      <div class="para-list">
         <p>
            IEEE 802.11a/b/g/n compatible
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Bluetooth 4.0 wireless technology
         </p>
      </div>
      <h4 class="h4-para-title">
         Size and weight
      </h4>
      <div class="para-list">
         <p>
            Height: 9.9 inches (25.1 cm)
         </p>
      </div>
      <div class="para-list">
         <p>
            Diameter: 6.6 inches (16.7 cm)
         </p>
      </div>
      <div class="para-list as-pdp-lastparalist">
         <p>
            Weight: 11 pounds (5 kg)<sup>3</sup>
         </p>
      </div>
   </div>
</div>
section.findAll('p')

但是我不知道如何基本上一块一块地浏览
,并在下一个标题之前使标题与以下信息保持同步。

您可以使用
itertools.groupby
div
的连续列表与之前的单个
h4
标题相关联:

from bs4 import BeautifulSoup as soup
import re, itertools
d = soup(content, 'html.parser')
new_d = d.find_all(re.compile('h4|div'), {'class':re.compile('para\-list|h4\-para\-title')})
_r = [list(b) for a, b in itertools.groupby(new_d, key=lambda x:x.name == 'h4')]
final_result = {re.sub('\n|\s{2,}', '', _r[i][0].text):[re.sub('\n|\s{2,}', '', c.text) for c in _r[i+1]] for i in range(0, len(_r), 2)}
输出:

{'Processor': ['3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache'], 
 'Memory': ['16GB (four 4GB) of 1866MHz DDR3 EEC'], 
 'Storage': ['512GB PCIe-based flash storage1'], 
 'Graphics': ['Dual AMD FirePro D300 graphics processors with 2GB of GDDR5 VRAM each'], 
 'Input/Output': ['Four USB 3 ports (up to 5 Gbps)', 'Six Thunderbolt 2 ports (up to 20 Gbps)', 'Dual Gigabit Ethernet ports', 'One HDMI 1.4 Ultra HD port'], 
 'Audio': ['Combined optical digital audio output/analog line out minijack', 'Headphone minijack with headset support', 'HDMI port supports multichannel audio output', 'Built-in speaker'], 
 'Wireless': ['802.11ac Wi-Fi wireless networking2', 'IEEE 802.11a/b/g/n compatible', 'Bluetooth 4.0 wireless technology'], 
 'Size and weight': ['Height: 9.9 inches (25.1 cm)', 'Diameter: 6.6 inches (16.7 cm)', 'Weight: 11 pounds (5 kg)3']
}
用于收集类别后面的所有
段落列表
元素的解决方案:

from collections import defaultdict

import requests
from bs4 import BeautifulSoup

url = "https://www.apple.com/shop/product/G0PK0LL/A/refurbished-mac-pro-37ghz-quad-core-intel-xeon-e5?fnode=3bb458bfb26c0f3137a9899791eba511037c7868c96bc9c236e2eeb016997c327ef9487f5e5bb8f13fb21e31d7a35da45e091125611c8e6c01a06b814f51596e3f2786b2d3f60262d4dd50a008f5acc8"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
section = soup.find("div", {"class": "as-productinfosection-panel TechSpecs-panel row"})

d = defaultdict(list)
for cat in section.select(".h4-para-title"):
    k = cat.text.strip()
    for item in cat.find_next_siblings():
        if "para-list" not in item.attrs["class"]:
            break
        else:
            d[k].append(item.text.strip())

print(dict(d))
结果:

{
    'Processor': ['3.7GHz Quad-Core Intel Xeon E5 processor with 10MB of L3 cache'],
    'Memory': ['16GB (four 4GB) of 1866MHz DDR3 EEC'],
    'Storage': ['512GB PCIe-based flash storage1'],
    'Graphics': ['Dual AMD FirePro D300 graphics processors with 2GB of GDDR5 VRAM each'],
    'Input/Output': ['Four USB 3 ports (up to 5 Gbps)', 'Six Thunderbolt 2 ports (up to 20 Gbps)', 'Dual Gigabit Ethernet ports', 'One HDMI 1.4 Ultra HD port'],
    'Audio': ['Combined optical digital audio output/analog line out minijack', 'Headphone minijack with headset support', 'HDMI port supports multichannel audio output', 'Built-in speaker'],
    'Wireless': ['802.11ac Wi-Fi wireless networking2', 'IEEE 802.11a/b/g/n compatible', 'Bluetooth 4.0 wireless technology'],
    'Size and weight': ['Height: 9.9 inches (25.1 cm)', 'Diameter: 6.6 inches (16.7 cm)', 'Weight: 11 pounds(5 kg)3']
}