Pyspark 如何在DataRicks中使用shutil压缩文件（在Azure Blob存储上）_Pyspark_Zip_Databricks_Azure Blob Storage_Shutil

Pyspark 如何在DataRicks中使用shutil压缩文件（在Azure Blob存储上）

pyspark

Pyspark 如何在DataRicks中使用shutil压缩文件（在Azure Blob存储上）,pyspark,zip,databricks,azure-blob-storage,shutil,Pyspark,Zip,Databricks,Azure Blob Storage,Shutil,我训练有素的深度学习模型存在于文件夹中的几个文件中。所以这与压缩数据帧无关我想压缩此文件夹（在Azure Blob存储中）。但当我使用舒蒂尔时，这似乎不起作用： import shutil modelPath = "/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376" zipPath= "/mnt/databricks/Deploy/" (no

我训练有素的深度学习模型存在于文件夹中的几个文件中。所以这与压缩数据帧无关

我想压缩此文件夹（在Azure Blob存储中）。但当我使用舒蒂尔时，这似乎不起作用：

import shutil
modelPath = "/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376"
zipPath= "/mnt/databricks/Deploy/" (no /dbfs here or it will error)
shutil.make_archive(base_dir= modelPath, format='zip', base_name=zipPath)

有人知道如何做到这一点，并将文件放到Azure Blob存储（我从中读取它）中吗？

最后我自己解决了这个问题

无法使用Shutil直接写入dbfs（Azure Blob存储）

您需要首先将文件放在databricks的本地驱动程序节点上，如下所示（在文档中无法直接写入Blob存储的地方读取）：

然后可以将文件从本地驱动程序节点复制到blob存储。请注意“文件：”以从本地存储中获取文件

blobStoragePath = "dbfs:/mnt/databricks/Models"
dbutils.fs.cp("file:" +zipPath + ".zip", blobStoragePath)

我为此损失了几个小时，如果这个答案对你有帮助，请投票

实际上，不使用

shutil

，我可以将Databricks

dbfs

中的文件压缩为zip文件，作为Azure blob存储的blob，该blob存储已装入

dbfs

下面是我使用Python标准库

os

和

zipfile

的示例代码

# Mount a container of Azure Blob Storage to dbfs
storage_account_name='<your storage account name>'
storage_account_access_key='<your storage account key>'
container_name = '<your container name>'

dbutils.fs.mount(
  source = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net",
  mount_point = "/mnt/<a mount directory name under /mnt, such as `test`>",
  extra_configs = {"fs.azure.account.key."+storage_account_name+".blob.core.windows.net":storage_account_access_key})

# List all files which need to be compressed
import os
modelPath  = '/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376'
filenames = [os.path.join(root, name) for root, dirs, files in os.walk(top=modelPath , topdown=False) for name in files]
# print(filenames)

# Directly zip files to Azure Blob Storage as a blob
# zipPath is the absoluted path of the compressed file on the mount point, such as `/dbfs/mnt/test/demo.zip`
zipPath = '/dbfs/mnt/<a mount directory name under /mnt, such as `test`>/demo.zip'
import zipfile
with zipfile.ZipFile(zipPath, 'w') as myzip:
  for filename in filenames:
#    print(filename)
    myzip.write(filename)

#将Azure Blob存储容器装载到dbfs
存储\帐户\名称=“”
存储\帐户\访问\密钥=“”
容器名称=“”
dbutils.fs.mount(
source=“wasbs://”+container\u name+“@”+storage\u account\u name+”.blob.core.windows.net”，
mount_point=“/mnt/这看起来比我做的更整洁！感谢您的深入解释。
# Mount a container of Azure Blob Storage to dbfs
storage_account_name='<your storage account name>'
storage_account_access_key='<your storage account key>'
container_name = '<your container name>'

dbutils.fs.mount(
  source = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net",
  mount_point = "/mnt/<a mount directory name under /mnt, such as `test`>",
  extra_configs = {"fs.azure.account.key."+storage_account_name+".blob.core.windows.net":storage_account_access_key})

# List all files which need to be compressed
import os
modelPath  = '/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376'
filenames = [os.path.join(root, name) for root, dirs, files in os.walk(top=modelPath , topdown=False) for name in files]
# print(filenames)

# Directly zip files to Azure Blob Storage as a blob
# zipPath is the absoluted path of the compressed file on the mount point, such as `/dbfs/mnt/test/demo.zip`
zipPath = '/dbfs/mnt/<a mount directory name under /mnt, such as `test`>/demo.zip'
import zipfile
with zipfile.ZipFile(zipPath, 'w') as myzip:
  for filename in filenames:
#    print(filename)
    myzip.write(filename)