Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/ionic-framework/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用selenium下载;idx“;文件(idx的哑剧是什么?)_Python_Firefox_Selenium_Selenium Webdriver_Mime Types - Fatal编程技术网

Python 如何使用selenium下载;idx“;文件(idx的哑剧是什么?)

Python 如何使用selenium下载;idx“;文件(idx的哑剧是什么?),python,firefox,selenium,selenium-webdriver,mime-types,Python,Firefox,Selenium,Selenium Webdriver,Mime Types,我尝试使用selenium下载以下文件: 这些只是纯文本文件,但它们奇怪的扩展名让我非常头疼。浏览器总是调用一些插件来读取文件,我不知道什么MIME类型是“idx” 在搜索了整个web之后,我认为一个简单的方法是设置firefox配置文件,如下所示: profile = webdriver.FirefoxProfile() profile.set_preference('browser.download.folderList', 2) profile.set_preference('brows

我尝试使用selenium下载以下文件:

这些只是纯文本文件,但它们奇怪的扩展名让我非常头疼。浏览器总是调用一些插件来读取文件,我不知道什么MIME类型是“idx”

在搜索了整个web之后,我认为一个简单的方法是设置firefox配置文件,如下所示:

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.dir', cachedir)
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf, text/plain, application/vnd.idx, application/xml, application/octet-stream, text/html, application/vnd.oasis.opendocument.text-web, application/rtf, text/richtext, application/xhtml+xml')
profile.set_preference('plugin.disable_full_page_plugin_for_types', 'application/pdf, text/plain, application/vnd.idx, application/xml, application/octet-stream, text/html, application/vnd.oasis.opendocument.text-web, application/rtf, text/richtext, application/xhtml+xml')
profile.set_preference('browser.helperApps.alwaysAsk.force', False)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('pdfjs.disabled', True)
return webdriver.Firefox(profile)
我试着把我能想象到的几乎所有东西都放在属性“browser.helperApps.neverAsk.saveToDisk”和“plugin.disable_full_page_plugin_for_type”上,但它们似乎都没有达到目标

有人知道在这里演什么样的哑剧吗?或者更一般地说,我们如何知道任意文件的MIME类型(请注意,某些文件扩展名不是标准的)

我的完整代码如下:

from bs4 import BeautifulSoup
import time
import os
from selenium import webdriver
from selenium.webdriver.common.alert import Alert
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException

def get_browser(cachedir):
    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.dir', cachedir)
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf, text/plain, application/vnd.idx, application/xml, application/octet-stream, text/html, application/vnd.oasis.opendocument.text-web, application/rtf, text/richtext, application/xhtml+xml, text/x-mail')
    profile.set_preference('plugin.disable_full_page_plugin_for_types', 'application/pdf, text/plain, application/vnd.idx, application/xml, application/octet-stream, text/html, application/vnd.oasis.opendocument.text-web, application/rtf, text/richtext, application/xhtml+xml, text/x-mail')
    profile.set_preference('browser.helperApps.alwaysAsk.force', False)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('pdfjs.disabled', True)
    return webdriver.Firefox(profile)

def write_content(page_source, file_path):
    soup = BeautifulSoup(page_source)
    form_content = soup.find_all("body")[0].text

    print("getting {}".format(file_path))

    with open(file_path, "w") as f_out:
        f_out.write(form_content.encode('utf-8'))

cachedir = "/Users/voiceup/Desktop"
form_dir = "forms/"
browser = get_browser(cachedir)
for year in range(1993, 2015):
    for qtr in range(1, 5):
        year = str(year)
        qtr = str(qtr)
        url = "ftp://ftp.sec.gov/edgar/full-index/" + year + "/QTR" + qtr + "/form.idx"
        browser.get(url)

        # alert means there is broken file
        # refresh the browser until there is no alert
        has_alert = True
        while has_alert:
            try: 
                WebDriverWait(browser, 2).until(EC.alert_is_present())
                alert = browser.switch_to_alert()
                alert.accept()
                print("alert accepted")
                browser.refresh()
            except TimeoutException:
                has_alert = False

        # manually download the file
        file_name = year + "_" + qtr + ".txt"
        file_path = os.path.join(form_dir, file_name)
        write_content(browser.page_source, file_path)


time.sleep(2)
browser.quit()

谢谢。

Selenium绝对不是这项工作的工具-它会给问题增加巨大的开销

在这种情况下,它非常适合:

import os
import ftplib

form_dir = "forms/"

ftp = ftplib.FTP('ftp.sec.gov', 'anonymous')

for year in range(1993, 2015):
    for qtr in range(1, 5):
        url = "edgar/full-index/{year}/QTR{qtr}/form.idx".format(year=year, qtr=qtr)
        filename = "{year}_{qtr}.txt".format(year=year, qtr=qtr)

        print "Process URL: " + url

        # manually download the file
        with open(os.path.join(form_dir, filename), "wb") as file:
            ftp.retrbinary("RETR " + url, file.write)

ftp.close()
运行脚本时,您将看到在
forms/
目录中创建的文件,控制台上将打印以下内容:

Process URL: edgar/full-index/1993/QTR1/form.idx
Process URL: edgar/full-index/1993/QTR2/form.idx
Process URL: edgar/full-index/1993/QTR3/form.idx
Process URL: edgar/full-index/1993/QTR4/form.idx
Process URL: edgar/full-index/1994/QTR1/form.idx
...

谢谢你的回复。我试图手动保存页面源代码。但它并没有完全起作用,因为其中一些文件很大。对于近年来的“form.idx”文件,文件大小约为100 Mbs。这会导致加载超时和文件损坏。@xjmfel好的,说得好。让我们看看还有什么其他选择。为什么这里需要硒?谢谢。有数百个文件,因此手动下载它们并不容易。还有其他方法做这项工作吗?我使用selenium是因为这是我知道的自动下载文件的唯一方法。谢谢。@xjmfel有很多选择。你能提供你目前掌握的代码吗?我需要看看你从哪里开始,需要下载什么,这样我就可以玩它了。谢谢,非常感谢!我把完整的代码附在原来的帖子上。我认为逻辑相当简单。我定义了cache_dir,但从未使用过它(因为上面的问题)。如果代码有任何混淆,请询问我。