Javascript Python:无法使用网页中的selenium下载

Javascript Python:无法使用网页中的selenium下载,javascript,python,selenium,Javascript,Python,Selenium,我的目的是从中下载一个zip文件 它是此网页中的一个链接。然后将它保存到这个目录“/home/vinvin/shKLSE/(我使用的是pythonaywhere)。然后解压它并在目录中提取csv文件 代码一直运行到结束,没有错误,但不会下载。 手动单击时,zip文件将自动下载 我的代码使用了有效的用户名和密码。使用了真实用户名和密码,以便更容易理解问题 #!/usr/bin/python print "hello from python 2" import urllib

我的目的是从中下载一个zip文件 它是此网页中的一个链接。然后将它保存到这个目录
“/home/vinvin/shKLSE/
(我使用的是pythonaywhere)。然后解压它并在目录中提取csv文件

代码一直运行到结束,没有错误,但不会下载。 手动单击时,zip文件将自动下载

我的代码使用了有效的用户名和密码。使用了真实用户名和密码,以便更容易理解问题

    #!/usr/bin/python
    print "hello from python 2"

    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    

    display = Display(visible=0, size=(800, 600))
    display.start()

    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', '/zip')

    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)

    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(30)

    browser.close()
    browser.quit()
    display.stop()

   zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
   zip_ref.extractall(/home/vinvin/sh/KLSE)
   zip_ref.close()
   os.remove(zip_ref)
HTML代码片段:

<li><a href="/prices/price_download_zip_file.zip?type=history_all&amp;market=bursa">All Historical Data</a> <span>About 220 MB</span></li>
  • 大约220 MB
  • 请注意,复制代码段时会显示&。它对视图源代码是隐藏的,所以我猜它是用JavaScript编写的

    我发现的观察结果

  • 目录
    home/vinvin/shKLSE
    没有创建,即使我运行代码没有错误

  • 我尝试下载一个小得多的zip文件,它可以在一秒钟内完成,但等待30秒后仍然无法下载。
    dl=browser.find_element_by_xpath(“/*[@href=”/prices/price_download_zip_file.zip?type=history_daily&date=20170519&market=bursa'])。单击()


  • 原因是网页加载缓慢。我在打开网页链接后添加了20秒的等待时间

    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    
    它不返回错误

    另外,,
    /zip
    是不正确的MIME类型。更改为
    配置文件。设置首选项('browser.helperApps.neverAsk.saveToDisk','application/zip')

    最后的更正:

       #!/usr/bin/python
        print "hello from python 2"
    
        import urllib2
        from selenium import webdriver
        from selenium.webdriver.common.keys import Keys
        import time
        from pyvirtualdisplay import Display
        import requests, zipfile, os    
    
        display = Display(visible=0, size=(800, 600))
        display.start()
    
        profile = webdriver.FirefoxProfile()
        profile.set_preference('browser.download.folderList', 2)
        profile.set_preference('browser.download.manager.showWhenStarting', False)
        profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
        # application/zip not /zip
        profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/zip')
    
        for retry in range(5):
            try:
                browser = webdriver.Firefox(profile)
                print "firefox"
                break
            except:
                time.sleep(3)
        time.sleep(1)
    
        browser.get("https://www.shareinvestor.com/my")
        time.sleep(10)
        login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
        print browser.current_url
        username = browser.find_element_by_id("sic_login_header_username")
        password = browser.find_element_by_id("sic_login_header_password")
        print "find id done"
        username.send_keys("bkcollection")
        password.send_keys("123456")
        print "log in done"
        login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
        login_attempt.submit()
        browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
        print browser.current_url
        time.sleep(20)
        dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
        time.sleep(30)
    
        browser.close()
        browser.quit()
        display.stop()
    
       zip_ref = zipfile.ZipFile('/home/vinvin/shKLSE/file.zip', 'r')
       zip_ref.extractall('/home/vinvin/shKLSE')
       zip_ref.close()
       # remove with correct path
       os.remove('/home/vinvin/shKLSE/file.zip')
    

    将其移出selenium的范围。更改首选项设置,以便在单击链接时(首先检查链接是否有效),它会弹出一个要求保存的弹出窗口,现在使用sikuli单击弹出窗口。
    Mime类型并不总是有效的,也没有黑白的答案来解释它为什么不起作用。

    我没有在你提到的网站上尝试过,但是下面的代码工作得很好,可以下载ZIP。如果你不能下载ZIP,Mime类型可能会有所不同。你可以使用chrome浏览器和网络检查来检查您尝试下载的文件的mime类型


    我重写了你的脚本,并用注释解释了我为什么做出更改。我认为你的主要问题可能是一个糟糕的模拟类型,但是,你的脚本有一个系统性问题日志,这最多只能使它不可靠。这次重写使用显式等待,这完全消除了使用
    time.sleep()的需要
    ,允许它尽可能快地运行,同时还消除了因网络拥塞而产生的错误

    您需要执行以下操作以确保安装了所有模块:

    pip安装请求重试pyvirtualdisplay

    剧本:

    #!/usr/bin/python
    
    from __future__ import print_function  # Makes your code portable
    
    import os
    import glob
    import zipfile
    from contextlib import contextmanager
    
    import requests
    from retry import retry
    from explicit import waiter, XPATH, ID
    from selenium import webdriver
    from pyvirtualdisplay import Display
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.wait import WebDriverWait
    
    DOWNLOAD_DIR = "/tmp/shKLSE/"
    
    
    def build_profile():
        profile = webdriver.FirefoxProfile()
        profile.set_preference('browser.download.folderList', 2)
        profile.set_preference('browser.download.manager.showWhenStarting', False)
        profile.set_preference('browser.download.dir', DOWNLOAD_DIR)
        # I think your `/zip` mime type was incorrect. This works for me
        profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
                               'application/vnd.ms-excel,application/zip')
    
        return profile
    
    
    # Retry is an elegant way to retry the browser creation
    # Though you should narrow the scope to whatever the actual exception is you are
    # retrying on
    @retry(Exception, tries=5, delay=3)
    @contextmanager  # This turns get_browser into a context manager
    def get_browser():
        # Use a context manager with Display, so it will be closed even if an
        # exception is thrown
        profile = build_profile()
        with Display(visible=0, size=(800, 600)):
            browser = webdriver.Firefox(profile)
            print("firefox")
            try:
                yield browser
            finally:
                # Let a try/finally block manage closing the browser, even if an
                # exception is called
                browser.quit()
    
    
    def main():
        print("hello from python 2")
        with get_browser() as browser:
            browser.get("https://www.shareinvestor.com/my")
    
            # Click the login button
            # waiter is a helper function that makes it easy to use explicit waits
            # with it you dont need to use time.sleep() calls at all
            login_xpath = '//*/div[@class="sic_logIn-bg"]/a'
            waiter.find_element(browser, login_xpath, XPATH).click()
            print(browser.current_url)
    
            # Log in
            username = "bkcollection"
            username_id = "sic_login_header_username"
            password = "123456"
            password_id = "sic_login_header_password"
            waiter.find_write(browser, username_id, username, by=ID)
            waiter.find_write(browser, password_id, password, by=ID, send_enter=True)
    
            # Wait for login process to finish by locating an element only found
            # after logging in, like the Logged In Nav
            nav_id = 'sic_loggedInNav'
            waiter.find_element(browser, nav_id, ID)
    
            print("log in done")
    
            # Load the target page
            target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?"
                          "type=price_download_all_stocks_bursa")
            browser.get(target_url)
            print(browser.current_url)
    
            # CLick download button
            all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?"
                              "type=history_all&market=bursa']")
            waiter.find_element(browser, all_data_xpath, XPATH).click()
    
            # This is a bit challenging: You need to wait until the download is complete
            # This file is 220 MB, it takes a while to complete. This method waits until
            # there is at least one file in the dir, then waits until there are no
            # filenames that end in `.part`
            # Note that is is problematic if there is already a file in the target dir. I
            # suggest looking into using the tempdir module to create a unique, temporary
            # directory for downloading every time you run your script
            print("Waiting for download to complete")
            at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) > 0
            WebDriverWait(glob.glob, 300).until(at_least_1)
    
            no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0
            WebDriverWait(glob.glob, 300).until(no_parts)
    
            print("Download Done")
    
            # Now do whatever it is you need to do with the zip file
            # zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')
            # zip_ref.extractall(DOWNLOAD_DIR)
            # zip_ref.close()
            # os.remove(zip_ref)
    
            print("Done!")
    
    
    if __name__ == "__main__":
        main()
    

    完全公开:我维护显式模块。它旨在使显式等待更容易使用,尤其是在这样的情况下,网站会根据用户交互缓慢加载动态内容。你可以用直接显式等待取代上面所有的
    water.XXX
    调用。

    我看不出显式等待有任何主要缺点您的代码块就是这样。但以下是通过此解决方案的一些建议&执行此自动测试脚本:

  • 这段代码在非市场时段工作得非常好。在市场时段,大量的
    JavaScript
    &
    Ajax调用都在进行中,处理这些调用超出了这个问题的范围
  • <> LI>您可以考虑首先检查预期的下载目录,如果不可用,则创建一个新的下载目录。该功能的代码块是Windows风格的,并且在Windows平台上工作得很完美。
  • 单击“登录”后,请等待HTML DOM正确呈现
  • 当您想结束下载过程时,您需要在
    FirefoxProfile
    中设置更多的首选项,如下面我的代码中所述
  • 始终考虑通过<代码>浏览器最大化浏览器窗口。
  • 开始下载时,需要等待足够的时间才能完全下载文件
  • 如果最后使用的是
    browser.quit()
    ,则不需要使用
    browser.close()
  • <> LI>您可以考虑替换所有的<代码>时间。
  • 以下是您自己的代码块,其中包含一些简单的调整:

    #!/usr/bin/python
    print "hello from python 2"
    
    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    
    
    display = Display(visible=0, size=(800, 600))
    display.start()
    
    newpath = 'C:\\home\\vivvin\\shKLSE'
    if not os.path.exists(newpath):
        os.makedirs(newpath)    
    
    profile = webdriver.FirefoxProfile()
    profile.set_preference("browser.download.dir",newpath);
    profile.set_preference("browser.download.folderList",2);
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/zip");
    profile.set_preference("browser.download.manager.showWhenStarting",False);
    profile.set_preference("browser.helperApps.neverAsk.openFile","application/zip");
    profile.set_preference("browser.helperApps.alwaysAsk.force", False);
    profile.set_preference("browser.download.manager.useWindow", False);
    profile.set_preference("browser.download.manager.focusWhenStarting", False);
    profile.set_preference("browser.helperApps.neverAsk.openFile", "");
    profile.set_preference("browser.download.manager.alertOnEXEOpen", False);
    profile.set_preference("browser.download.manager.showAlertOnComplete", False);
    profile.set_preference("browser.download.manager.closeWhenDone", True);
    profile.set_preference("pdfjs.disabled", True);
    
    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)
    
    browser.maximize_window()
    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    time.sleep(10)
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(900)
    
    browser.close()
    browser.quit()
    display.stop()
    
    zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
    zip_ref.extractall(/home/vinvin/sh/KLSE)
    zip_ref.close()
    os.remove(zip_ref)
    

  • 如果这能回答您的问题,请告诉我。

    如前所述,没有弹出窗口。您可以使用用户名和密码手动尝试。想知道Windows平台上的解决方案是否适合您吗?Thanks@Dev这是可以接受的。只要它能够持续工作。观察到您在启动时使用了
    folderList
    show
    download.dir
    &
    neverAsk.saveToDisk
    ,但您在说明中没有提到它们。基于这些功能,您有什么要求吗?谢谢,只要在zip目录下下载并解压缩即可。您能解释一下
    @retry(异常,trys=5,delay=3)
    @contextmanager
    ?是否有任何其他通用模块可用于替换
    重试
    显式
    ?Pythonywhere没有这两个模块。@bkcollection,在我更新我的答案之前,您能否先告诉我您是否可以使用它来安装外部依赖项?特别是,
    重试
    显式
    https://help.pythonanywhere.com/pages/InstallingNewModules/
    firefox基本上可以正常工作,但通常在
    print”firefox之后就卡住了“
    ,需要重新运行。三个细节中有两个更适合初学者。您建议如何隐式替换
    wait#!/usr/bin/python
    print "hello from python 2"
    
    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    
    
    display = Display(visible=0, size=(800, 600))
    display.start()
    
    newpath = 'C:\\home\\vivvin\\shKLSE'
    if not os.path.exists(newpath):
        os.makedirs(newpath)    
    
    profile = webdriver.FirefoxProfile()
    profile.set_preference("browser.download.dir",newpath);
    profile.set_preference("browser.download.folderList",2);
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/zip");
    profile.set_preference("browser.download.manager.showWhenStarting",False);
    profile.set_preference("browser.helperApps.neverAsk.openFile","application/zip");
    profile.set_preference("browser.helperApps.alwaysAsk.force", False);
    profile.set_preference("browser.download.manager.useWindow", False);
    profile.set_preference("browser.download.manager.focusWhenStarting", False);
    profile.set_preference("browser.helperApps.neverAsk.openFile", "");
    profile.set_preference("browser.download.manager.alertOnEXEOpen", False);
    profile.set_preference("browser.download.manager.showAlertOnComplete", False);
    profile.set_preference("browser.download.manager.closeWhenDone", True);
    profile.set_preference("pdfjs.disabled", True);
    
    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)
    
    browser.maximize_window()
    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    time.sleep(10)
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(900)
    
    browser.close()
    browser.quit()
    display.stop()
    
    zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
    zip_ref.extractall(/home/vinvin/sh/KLSE)
    zip_ref.close()
    os.remove(zip_ref)