Python 通过公共HTTP将大型文件自动检索到Google云存储中_Python_Google App Engine_Google Cloud Storage_Google Cloud Data Transfer

Python 通过公共HTTP将大型文件自动检索到Google云存储中

python google-app-engine google-cloud-storage

Python 通过公共HTTP将大型文件自动检索到Google云存储中,python,google-app-engine,google-cloud-storage,google-cloud-data-transfer,Python,Google App Engine,Google Cloud Storage,Google Cloud Data Transfer,出于天气处理的目的，我希望在谷歌云存储中自动检索每日天气预报数据这些文件在公共HTTPURL（）上可用，但它们非常大（在30到300兆字节之间）。文件大小是主要问题在阅读了前面的stackoverflow主题后，我尝试了两种不成功的方法： 1/首次尝试通过谷歌应用程序引擎中的urlfetch 从google.appengine.api导入urlfetch url=”http://dcpc-nwp.meteo.fr/servic..." 结果=urlfetch.fetch（url） […]#保

出于天气处理的目的，我希望在谷歌云存储中自动检索每日天气预报数据

这些文件在公共HTTPURL（）上可用，但它们非常大（在30到300兆字节之间）。文件大小是主要问题

在阅读了前面的stackoverflow主题后，我尝试了两种不成功的方法：

1/首次尝试通过谷歌应用程序引擎中的urlfetch

从google.appengine.api导入urlfetch url=”http://dcpc-nwp.meteo.fr/servic..." 结果=urlfetch.fetch（url） […]#保存在谷歌云存储桶中的代码但我在urlfetch行上收到以下错误消息：

DeadlineExceededError:等待URL的HTTP响应时超过了截止日期

通过云存储传输服务进行2/秒尝试

根据文档，可以通过云存储传输服务直接将HTTP数据检索到云存储中：

但下载之前需要文件的大小和md5。这个选项在我的情况下不起作用，因为网站没有提供这些信息

3/有什么想法吗

你有没有看到任何解决方案可以将HTTP上的大文件自动检索到我的云存储桶中？

目前，谷歌的传输服务需要MD5和大小；我们知道，在像您这样的情况下，这可能很难处理，但不幸的是，我们今天没有一个很好的解决方案

除非您能够通过自己（暂时）下载文件来获得大小和MD5，否则我认为这是您所能做的最好的方法。

3/使用计算引擎实例的变通方法

由于无法使用AppEngine或直接使用云存储从外部HTTP检索大型文件，因此我使用了一个始终运行的计算引擎实例的解决方案

此实例定期检查是否有新的天气文件可用，将其下载并上载到云存储桶

出于可扩展性、维护和成本的考虑，我宁愿只使用无服务器服务，但希望：

它在全新的f1 micro Compute引擎实例上运行良好（无需额外的软件包，如果全天候运行，则每月仅需4美元）
如果实例和存储桶位于同一区域（0$/月），则从计算引擎到Google云存储的网络流量是免费的

如本链接所述，使用curl-I命令可以轻松快速地检索md5和文件大小。
然后可以将存储传输服务配置为使用该信息

另一种选择是使用无服务器云功能。在Python中，它可能看起来像下面的内容

import requests

def download_url_file(url):
    try:
        print('[ INFO ] Downloading {}'.format(url))
        req = requests.get(url)
        if req.status_code==200:
            # Download and save to /tmp
            output_filepath = '/tmp/{}'.format(url.split('/')[-1])
            output_filename = '{}'.format(url.split('/')[-1])
            open(output_filepath, 'wb').write(req.content)
            print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
            return output_filename
        else:
            print('[ ERROR ] Status Code: {}'.format(req.status_code))
    except Exception as e:
        print('[ ERROR ] {}'.format(e))
    return output_filename

感谢这些信息促使我找到一个计算引擎实例的解决方法。我遇到了与Matthieu相同的问题。截至2020年3月，还有一个额外的无服务器解决方案：谷歌云功能（GCF）。GCF有2GB内存-

import requests

def download_url_file(url):
    try:
        print('[ INFO ] Downloading {}'.format(url))
        req = requests.get(url)
        if req.status_code==200:
            # Download and save to /tmp
            output_filepath = '/tmp/{}'.format(url.split('/')[-1])
            output_filename = '{}'.format(url.split('/')[-1])
            open(output_filepath, 'wb').write(req.content)
            print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
            return output_filename
        else:
            print('[ ERROR ] Status Code: {}'.format(req.status_code))
    except Exception as e:
        print('[ ERROR ] {}'.format(e))
    return output_filename