Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何以“更智能”的方式使用python下载文件?_Python_Http_Download - Fatal编程技术网

如何以“更智能”的方式使用python下载文件?

如何以“更智能”的方式使用python下载文件?,python,http,download,Python,Http,Download,我需要在Python中通过http下载几个文件 最明显的方法就是使用urllib2: import urllib2 u = urllib2.urlopen('http://server.com/file.html') localFile = open('file.html', 'w') localFile.write(u.read()) localFile.close() 但我必须以某种方式处理令人讨厌的URL,比如:http://server.com/!Run.aspx/someoddtex

我需要在Python中通过http下载几个文件

最明显的方法就是使用urllib2:

import urllib2
u = urllib2.urlopen('http://server.com/file.html')
localFile = open('file.html', 'w')
localFile.write(u.read())
localFile.close()
但我必须以某种方式处理令人讨厌的URL,比如:http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf。通过浏览器下载时,该文件有一个人类可读的名称,即accounts.pdf


有没有办法用python处理这个问题,这样我就不需要知道文件名并将它们硬编码到我的脚本中?

下载这样的脚本往往会推送一个头,告诉用户代理文件的名称:

Content-Disposition: attachment; filename="the filename.ext"
如果你能抓取这个标题,你就能得到正确的文件名

有一个网站提供了一点代码来抓取内容

remotefile = urllib2.urlopen('http://example.com/somefile.zip')
remotefile.info()['Content-Disposition']

根据评论和@Oli的anwser,我提出了如下解决方案:

from os.path import basename
from urlparse import urlsplit

def url2name(url):
    return basename(urlsplit(url)[2])

def download(url, localFileName = None):
    localName = url2name(url)
    req = urllib2.Request(url)
    r = urllib2.urlopen(req)
    if r.info().has_key('Content-Disposition'):
        # If the response has Content-Disposition, we take file name from it
        localName = r.info()['Content-Disposition'].split('filename=')[1]
        if localName[0] == '"' or localName[0] == "'":
            localName = localName[1:-1]
    elif r.url != url: 
        # if we were redirected, the real file name we take from the final URL
        localName = url2name(r.url)
    if localFileName: 
        # we can force to save the file as specified name
        localName = localFileName
    f = open(localName, 'wb')
    f.write(r.read())
    f.close()
localName = localName.replace('"', '').replace("'", "")
if localName == '':
    localName = SOME_DEFAULT_FILE_NAME

它从内容配置中获取文件名;如果不存在,则使用URL中的文件名。如果发生重定向,则考虑最终URL

结合以上大部分内容,这里有一个更具python风格的解决方案:

import urllib2
import shutil
import urlparse
import os

def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            # If the response has Content-Disposition, try to get filename from it
            cd = dict(map(
                lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
                openUrl.info()['Content-Disposition'].split(';')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
        # if no filename was found above, parse it out of the final URL.
        return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    r = urllib2.urlopen(urllib2.Request(url))
    try:
        fileName = fileName or getFileName(url,r)
        with open(fileName, 'wb') as f:
            shutil.copyfileobj(r,f)
    finally:
        r.close()
2肯德尔:

这是不安全的-web服务器可能会将错误的格式化名称传递为[file.ext]或[file.ext'],甚至为空,localName[0]将引发异常。 正确的代码可能如下所示:

from os.path import basename
from urlparse import urlsplit

def url2name(url):
    return basename(urlsplit(url)[2])

def download(url, localFileName = None):
    localName = url2name(url)
    req = urllib2.Request(url)
    r = urllib2.urlopen(req)
    if r.info().has_key('Content-Disposition'):
        # If the response has Content-Disposition, we take file name from it
        localName = r.info()['Content-Disposition'].split('filename=')[1]
        if localName[0] == '"' or localName[0] == "'":
            localName = localName[1:-1]
    elif r.url != url: 
        # if we were redirected, the real file name we take from the final URL
        localName = url2name(r.url)
    if localFileName: 
        # we can force to save the file as specified name
        localName = localFileName
    f = open(localName, 'wb')
    f.write(r.read())
    f.close()
localName = localName.replace('"', '').replace("'", "")
if localName == '':
    localName = SOME_DEFAULT_FILE_NAME
使用wget:

使用URL检索:

urllib.urlretrieve(url, custom_file_name)

urlretrieve还创建目录结构(如果不存在)

服务器上的文件名是否相关?大概这些文件对您有一定的意义,所以您应该能够自己命名它们。如果文件名没有意义,你可以自己随意取一个唯一的名字,我希望文件名可读且有意义。问题是,脚本将从文本文件下载URL,URL将由非技术人员添加和删除。不,它们可能重定向到普通文件。但是,如果它像大多数下载脚本一样,它们正在推动内容配置。请务必检查。如果它将我重定向到普通文件,也很容易,我可以通过remotefile.url访问实际的url,不是吗?我发现这很有用。但要下载更大的文件,而不将其全部内容存储在内存中,我必须找到这个问题,将您的“r”复制到“f”:import shutil shutil.copyfileobjr,工作得很好,但我会通过调用urllib.unquote来包装urlspiltUrl[2],否则文件名将采用百分比编码。我是这样做的:return basenameurllib.unquoteeurlsplitur[2]甚至更好:local_name.strip'\'''''''''''-这只会从开头和结尾剥离,而且更简洁。