Pyspark 如何在DataRicks中使用shutil压缩文件(在Azure Blob存储上)

Pyspark 如何在DataRicks中使用shutil压缩文件(在Azure Blob存储上),pyspark,zip,databricks,azure-blob-storage,shutil,Pyspark,Zip,Databricks,Azure Blob Storage,Shutil,我训练有素的深度学习模型存在于文件夹中的几个文件中。所以这与压缩数据帧无关 我想压缩此文件夹(在Azure Blob存储中)。但当我使用舒蒂尔时,这似乎不起作用: import shutil modelPath = "/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376" zipPath= "/mnt/databricks/Deploy/" (no

我训练有素的深度学习模型存在于文件夹中的几个文件中。所以这与压缩数据帧无关

我想压缩此文件夹(在Azure Blob存储中)。但当我使用舒蒂尔时,这似乎不起作用:

import shutil
modelPath = "/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376"
zipPath= "/mnt/databricks/Deploy/" (no /dbfs here or it will error)
shutil.make_archive(base_dir= modelPath, format='zip', base_name=zipPath)

有人知道如何做到这一点,并将文件放到Azure Blob存储(我从中读取它)中吗?

最后我自己解决了这个问题

无法使用Shutil直接写入dbfs(Azure Blob存储)

您需要首先将文件放在databricks的本地驱动程序节点上,如下所示(在文档中无法直接写入Blob存储的地方读取):

然后可以将文件从本地驱动程序节点复制到blob存储。请注意“文件:”以从本地存储中获取文件

blobStoragePath = "dbfs:/mnt/databricks/Models"
dbutils.fs.cp("file:" +zipPath + ".zip", blobStoragePath)

我为此损失了几个小时,如果这个答案对你有帮助,请投票

实际上,不使用
shutil
,我可以将Databricks
dbfs
中的文件压缩为zip文件,作为Azure blob存储的blob,该blob存储已装入
dbfs

下面是我使用Python标准库
os
zipfile
的示例代码

# Mount a container of Azure Blob Storage to dbfs
storage_account_name='<your storage account name>'
storage_account_access_key='<your storage account key>'
container_name = '<your container name>'

dbutils.fs.mount(
  source = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net",
  mount_point = "/mnt/<a mount directory name under /mnt, such as `test`>",
  extra_configs = {"fs.azure.account.key."+storage_account_name+".blob.core.windows.net":storage_account_access_key})

# List all files which need to be compressed
import os
modelPath  = '/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376'
filenames = [os.path.join(root, name) for root, dirs, files in os.walk(top=modelPath , topdown=False) for name in files]
# print(filenames)

# Directly zip files to Azure Blob Storage as a blob
# zipPath is the absoluted path of the compressed file on the mount point, such as `/dbfs/mnt/test/demo.zip`
zipPath = '/dbfs/mnt/<a mount directory name under /mnt, such as `test`>/demo.zip'
import zipfile
with zipfile.ZipFile(zipPath, 'w') as myzip:
  for filename in filenames:
#    print(filename)
    myzip.write(filename)
#将Azure Blob存储容器装载到dbfs
存储\帐户\名称=“”
存储\帐户\访问\密钥=“”
容器名称=“”
dbutils.fs.mount(
source=“wasbs://”+container\u name+“@”+storage\u account\u name+”.blob.core.windows.net”,

mount_point=“/mnt/

这看起来比我做的更整洁!感谢您的深入解释。
# Mount a container of Azure Blob Storage to dbfs
storage_account_name='<your storage account name>'
storage_account_access_key='<your storage account key>'
container_name = '<your container name>'

dbutils.fs.mount(
  source = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net",
  mount_point = "/mnt/<a mount directory name under /mnt, such as `test`>",
  extra_configs = {"fs.azure.account.key."+storage_account_name+".blob.core.windows.net":storage_account_access_key})

# List all files which need to be compressed
import os
modelPath  = '/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376'
filenames = [os.path.join(root, name) for root, dirs, files in os.walk(top=modelPath , topdown=False) for name in files]
# print(filenames)

# Directly zip files to Azure Blob Storage as a blob
# zipPath is the absoluted path of the compressed file on the mount point, such as `/dbfs/mnt/test/demo.zip`
zipPath = '/dbfs/mnt/<a mount directory name under /mnt, such as `test`>/demo.zip'
import zipfile
with zipfile.ZipFile(zipPath, 'w') as myzip:
  for filename in filenames:
#    print(filename)
    myzip.write(filename)