使用python和beautifulsoup按div获取url

使用python和beautifulsoup按div获取url,python,html,beautifulsoup,Python,Html,Beautifulsoup,我正在尝试使用BeautifulSoup和Selenium从网站上删除pdf 我尝试过以不同的方式使用find_all函数,但没有得到我需要的结果 基本上,我希望能够做到的是在每个季度获得链接,例如2014年第4季度至2015年第3季度以及马来西亚、印度尼西亚等国家,这样我就可以按季度将pdf刮到一个文件夹中,然后在其中为这些国家创建一个子文件夹 以下是该网站的html片段: </div><a class="accord-header accord-header-5049 ac

我正在尝试使用BeautifulSoup和Selenium从网站上删除pdf

我尝试过以不同的方式使用find_all函数,但没有得到我需要的结果

基本上,我希望能够做到的是在每个季度获得链接,例如2014年第4季度至2015年第3季度以及马来西亚、印度尼西亚等国家,这样我就可以按季度将pdf刮到一个文件夹中,然后在其中为这些国家创建一个子文件夹

以下是该网站的html片段:

</div><a class="accord-header accord-header-5049 accord-header-supply-chain-resources"><div>Supply Chain</div></a><div class="accord-body accord-body-5049 accord-body-supply-chain-resources" style="display: none;"><ul>
<li class="folder">
                    <div>Q4 2014 – Q3 2015</div>
                    <ul style="display: none;">
                        <li class="folder">
                            <div>Indonesia</div>
                            <ul style="display: none;">
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Indonesia/MNA KualaTanjung_L1--160122.pdf">MNA KualaTanjung</a></li>
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Indonesia/MNA Paya Pasir_L1 --160122.pdf">MNA Paya Pasir</a></li>
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Indonesia/MNA Pulo Gadung_L1 --160122.pdf">MNA Pulo Gadung</a></li>
                            </ul>
                        </li>
                        <li class="folder">
                            <div>Malaysia</div>
                            <ul style="display: none;">
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Malaysia/BEO_L1 -- 160122.pdf">BEO Bintulu</a></li>
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Malaysia/LDEO_L1 -- 160122.pdf">LDEO Lahad Datu</a></li>

                            </ul>
                        </li>
                        <li class="folder">
                            <div>Destination Countries</div>
                            <ul style="display: none;">
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/Bangladesh_160122 -- new.pdf">Bangladesh</a></li>
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/China- Oleochemical_160122 -- new.pdf">China- Oleochemical</a></li>
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/China- Specialty Fats_160122 -- new.pdf">China- Specialty Fats</a></li>
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/Europe_Brake -- 160122 -- new.pdf">Europe_Brake</a></li>
                                <li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/Europe_Rotterdam -- 160122 -- new.pdf">Europe_Rotterdam</a></li>

                            </ul>
                        </li>
                    </ul>
                </li>
                <li class="folder">
                    <div>Q1 – Q4 2015</div>
                    <ul style="display: none;">
                        <li class="folder">
                            <div>Indonesia</div>
                            <ul style="display: none;">
                                <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_MNA-KTJ_L1.pdf">MNA KualaTanjung</a></li>
                                <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_MNA-PG_L1.pdf">MNA Paya Pasir</a></li>
                                <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_MNA-PPS_L1.pdf">MNA Pulo Gadung</a></li>
                                <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_MNS-BTG_L1.pdf">MNS Bitung</a></li>

                            </ul>
                        </li>
                        <li class="folder">
                            <div>Malaysia</div>
                            <ul style="display: none;">
                            <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_BEO_L1.pdf">BEO Bintulu</a></li>
                            <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_LDEO_L1.pdf">LDEO Lahad Datu</a></li>
                            <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_NatOleo_L1.pdf">NatOleo Pasir Gudang</a></li>
                            <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_PGEO-Lumut_L1.pdf">PGEO Lumut</a></li>
                            </ul>
                        </li>
                        <li class="folder">
                            <div>Destination Countries</div>
                            <ul style="display: none;">
                                <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_Bangladesh_L1.pdf">Bangladesh</a></li>
                                <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_China-Oleochemical_L1.pdf">China- Oleochemical</a></li>
                                <li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_China-Specialty Fats_L1.pdf">China- Specialty Fats</a></li>

                            </ul>
                        </li>
                    </ul>
                </li>

该页面正在加载JavaScript,因此我只是使用Selenium加载页面并获取html。我还修改了代码,只针对供应链部分

编辑:

这个新版本保持相同的浏览器打开,下载PDF以下载顶部附近的目录集,并将它们移动到正确的目录结构中。无论在何处运行此脚本,都将创建目录树

由于该网站似乎采用了反机器人功能,我将时间随机分配。3-9秒的睡眠时间很容易改变。另一件事是,如果脚本出于任何原因停止,您应该能够从停止下载的地方恢复。代码检查文件是否已经存在于正确的目录中,并且仅当文件不存在时才会下载它

为了节省时间,似乎总共有525个PDF文件,我只从第一季度目录下载了PDF文件进行测试,但是如果有任何错误,请告诉我

import os
import random
import shutil
import time
from collections import defaultdict
from urllib.parse import quote, urljoin

from bs4 import BeautifulSoup
from bs4.element import Tag
from selenium import webdriver

# Setup Chrome to download PDFs
download_dir = '/home/lettuce/Downloads'  # "D:\z_Temp\Wilmar_Traceability"  # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()
profile = {
    "plugins.plugins_list": [{
        "enabled": False,
        "name": "Chrome PDF Viewer"
    }],
    # Disable Chrome's PDF Viewer
    "download.default_directory": download_dir,
    "download.extensions_to_open": "applications/pdf"
}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome(chrome_options=options)

# Get page source of all PDF links
url = 'http://www.wilmar-international.com/sustainability/resource-library/'
driver.get(url)
page_html = driver.page_source

# Parse out PDF links and a structure for the folders
soup = BeautifulSoup(page_html, 'lxml')
supply_chain = soup.select_one(
    '#text-wrap-sub > div.sub_cont_left > div > div > div > '
    'div.accord-body.accord-body-5049.accord-body-supply-chain-resources > ul'
)
result = {}
for li in supply_chain:
    if isinstance(li, Tag):
        quarter = li.div.text
        documents = defaultdict(list)
        for folder in li.find_all('li', class_='folder'):
            country = folder.div.text
            for document in folder.find_all('li', class_="document"):
                documents[country].append(document.a['href'])
        result[quarter] = documents


supply_chain_dir = os.path.join(os.getcwd(), 'SupplyChain')
os.makedirs(supply_chain_dir, exist_ok=True)
for quarter, countries in result.items():
    # create quarter directory
    quarter_dir = os.path.join(supply_chain_dir, quarter)
    os.makedirs(quarter_dir, exist_ok=True)
    for country, documents in countries.items():
        # create country directory
        country_dir = os.path.join(quarter_dir, country)
        os.makedirs(country_dir, exist_ok=True)
        for document in documents:
            filename = document.split('/')[-1]
            if not os.path.exists(os.path.join(country_dir, filename)):
                # download pdf and move it to country directory
                driver.get(urljoin(url, quote(document)))
                time.sleep(random.randint(3, 9))
                shutil.move(
                    src=os.path.join(download_dir, filename),
                    dst=country_dir
                )

driver.quit()

非常感谢这个“疯狂”节目!输出看起来很棒,但不幸的是,当我在整个网页上运行它时,我得到了这个错误AttributeError:“NoneType”对象没有属性“text”。当我试图从中获取信息时,整个页面都在这里。我只对供应链部分感兴趣。再次感谢你的帮助!非常感谢,@疯狂的莴苣!效果很好。我知道我应该在最初的问题中问这个问题,但我只是想知道你是否知道我如何使用selenium从字典中快速创建文件夹并运行链接来下载pdf?
import os
import random
import shutil
import time
from collections import defaultdict
from urllib.parse import quote, urljoin

from bs4 import BeautifulSoup
from bs4.element import Tag
from selenium import webdriver

# Setup Chrome to download PDFs
download_dir = '/home/lettuce/Downloads'  # "D:\z_Temp\Wilmar_Traceability"  # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()
profile = {
    "plugins.plugins_list": [{
        "enabled": False,
        "name": "Chrome PDF Viewer"
    }],
    # Disable Chrome's PDF Viewer
    "download.default_directory": download_dir,
    "download.extensions_to_open": "applications/pdf"
}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome(chrome_options=options)

# Get page source of all PDF links
url = 'http://www.wilmar-international.com/sustainability/resource-library/'
driver.get(url)
page_html = driver.page_source

# Parse out PDF links and a structure for the folders
soup = BeautifulSoup(page_html, 'lxml')
supply_chain = soup.select_one(
    '#text-wrap-sub > div.sub_cont_left > div > div > div > '
    'div.accord-body.accord-body-5049.accord-body-supply-chain-resources > ul'
)
result = {}
for li in supply_chain:
    if isinstance(li, Tag):
        quarter = li.div.text
        documents = defaultdict(list)
        for folder in li.find_all('li', class_='folder'):
            country = folder.div.text
            for document in folder.find_all('li', class_="document"):
                documents[country].append(document.a['href'])
        result[quarter] = documents


supply_chain_dir = os.path.join(os.getcwd(), 'SupplyChain')
os.makedirs(supply_chain_dir, exist_ok=True)
for quarter, countries in result.items():
    # create quarter directory
    quarter_dir = os.path.join(supply_chain_dir, quarter)
    os.makedirs(quarter_dir, exist_ok=True)
    for country, documents in countries.items():
        # create country directory
        country_dir = os.path.join(quarter_dir, country)
        os.makedirs(country_dir, exist_ok=True)
        for document in documents:
            filename = document.split('/')[-1]
            if not os.path.exists(os.path.join(country_dir, filename)):
                # download pdf and move it to country directory
                driver.get(urljoin(url, quote(document)))
                time.sleep(random.randint(3, 9))
                shutil.move(
                    src=os.path.join(download_dir, filename),
                    dst=country_dir
                )

driver.quit()