Python 将时间戳添加到使用urllib.urlretrieve下载的文件_Python_Urllib

Python 将时间戳添加到使用urllib.urlretrieve下载的文件

python

Python 将时间戳添加到使用urllib.urlretrieve下载的文件,python,urllib,Python,Urllib,我正在使用urllib.urlretrieve下载文件，我想添加一些内容，以便在下载之前检查更改。我已经有了如下内容： import urllib urllib.urlretrieve("http://www.site1.com/file.txt", r"output/file1.txt") urllib.urlretrieve("http://www.site2.com/file.txt", r"output/file2.txt") 理想情况下，我希望脚本检查更改（比较上次修改的戳记？），

我正在使用

urllib.urlretrieve

下载文件，我想添加一些内容，以便在下载之前检查更改。我已经有了如下内容：

import urllib

urllib.urlretrieve("http://www.site1.com/file.txt", r"output/file1.txt")
urllib.urlretrieve("http://www.site2.com/file.txt", r"output/file2.txt")

理想情况下，我希望脚本检查更改（比较上次修改的戳记？），如果相同则忽略，如果更新则下载，我需要脚本向文件名添加时间戳

有人能帮忙吗

我是编程新手（python是我的第一个），所以欢迎任何批评

文件名中时间戳的最简单方法是：

import time
'output/file_%d.txt' % time.time()

人类可以通过以下方式阅读：

from datetime import datetime
n = datetime.now()
n.strftime('output/file_%Y%m%d_%H%M%S.txt')

urllib.urlretrieve（）

已经为您完成了这项工作。如果输出文件名存在，它将执行所有必要的检查以避免再次下载

但这只有在服务器支持时才起作用。因此，您可能需要打印HTTP头（函数调用的第二个结果），以查看是否可以进行缓存

此外，本文可能有助于：

它的末尾有这样一个代码：

import urllib
import os

def reporthook(blocks_read, block_size, total_size):
    if not blocks_read:
        print 'Connection opened'
        return
    if total_size < 0:
        # Unknown size
        print 'Read %d blocks' % blocks_read
    else:
        amount_read = blocks_read * block_size
        print 'Read %d blocks, or %d/%d' % (blocks_read, amount_read, total_size)
    return

try:
    filename, msg = urllib.urlretrieve('http://blog.doughellmann.com/', reporthook=reporthook)
    print
    print 'File:', filename
    print 'Headers:'
    print msg
    print 'File exists before cleanup:', os.path.exists(filename)

finally:
    urllib.urlcleanup()

    print 'File still exists:', os.path.exists(filename)

导入urllib
导入操作系统
def reporthook（块读取、块大小、总大小）：
如果没有，请读取：
打印“连接已打开”
返回
如果总尺寸小于0：
#未知大小
打印“已读取%d个块”“%blocks\u已读取”
其他：
读取量=块读取*块大小
打印“读取%d个块，或%d/%d%”（块读取、读取量、总大小）
返回
尝试：
文件名，msg=urllib.urlretrieve（'http://blog.doughellmann.com/，reporthook=reporthook）
打印
打印“文件：”，文件名
打印“标题：”
打印味精
打印“文件在清理之前存在：”，os.path.exists（文件名）
最后：
urllib.urlcleanup（）
打印“文件仍然存在：”，os.path.exists（文件名）

这将下载一个文件，显示进度并打印标题。使用它来调试您的场景，找出缓存为什么不能像您期望的那样工作。

不幸的是，在python中这似乎很难做到，因为您必须自己完成所有事情。另外，

urlretrieve

的界面也不是很好

以下代码应执行必要的步骤（如果文件存在，则添加“If Modified-Since”头并调整下载文件的时间戳）：

-1问题是如何确定服务器上的资源是否已更改。我的问题不完全清楚，但提到了文件名的时间戳。这是一个划时代的时间，知道如何使其成为标准（人类可读）时间/日期吗？嗨，Aaron，我的urllib.urlretrieve实现会不断覆盖文件，即使文件名相同。我需要做些什么来调用此功能？当你说“覆盖”时，你会看到它正在下载块？你有证据证明urlretrieve会这样做吗？my/usr/lib/python2.7/urllib.py中的retrieve函数显然不起作用。从不查看上次修改的头，从不统计文件以获取时间，能够在自头修改后发送，因此能够使用304个响应。它使用的唯一标题是Content Length，用于确认下载是否符合预期大小。只是盲目地打开URL，然后将其写入文件，而不管它是否已经存在。通过查看Web服务器日志以及代码（在我的示例中，我控制双方）来确认，我的证据是文档：“如果URL指向本地文件[…]，则对象不会被复制。”（）可能文档有问题？谢谢，这正是我要找的。值得注意的是，您需要导入

urllib.request

、

time

和

calendar

（所有这些都在标准库中），并且可能需要实现

status\u callback

函数或删除此代码的参数才能工作。

def download_file(url, local_filename):
    opener = urllib.request.build_opener()
    if os.path.isfile(local_filename):
        timestamp = os.path.getmtime(local_filename)
        timestr = time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime(timestamp))
        opener.addheaders.append(("If-Modified-Since", timestr))
    urllib.request.install_opener(opener)
    try:
        local_filename, headers = urllib.request.urlretrieve(url, local_filename, reporthook=status_callback)
        if 'Last-Modified' in headers:
            mtime = calendar.timegm(time.strptime(headers['Last-Modified'], '%a, %d %b %Y %H:%M:%S GMT'))
            os.utime(local_filename, (mtime, mtime))
    except urllib.error.HTTPError as e:
        if e.code != 304:
            raise e
    urllib.request.install_opener(urllib.request.build_opener())  # Reset opener
    return local_filename