Python 3.x 从Azure数据库读取Excel文件_Python 3.x_Excel_Azure Databricks_Azure Data Lake Gen2

Python 3.x 从Azure数据库读取Excel文件

python-3.x excel

Python 3.x 从Azure数据库读取Excel文件,python-3.x,excel,azure-databricks,azure-data-lake-gen2,Python 3.x,Excel,Azure Databricks,Azure Data Lake Gen2,我正在尝试从Azure Databricks准备Excel文件（.xlsx），该文件位于ADLS Gen 2中例如： srcPathforParquet = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//abc.parquet" srcPathforExcel = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//

我正在尝试从Azure Databricks准备Excel文件（

.xlsx

），该文件位于ADLS Gen 2中

例如：

srcPathforParquet = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//abc.parquet"
srcPathforExcel = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//src.xlsx"

从路径读取拼花地板文件效果良好

srcparquetDF = spark.read.parquet(srcPathforParquet )

从路径读取excel文件时抛出错误：没有此类文件或目录

srcexcelDF = pd.read_excel(srcPathforExcel , keep_default_na=False, na_values=[''])

根据my Repo，无法使用存储帐户访问密钥直接访问从ADLS gen2读取excel文件。当我试图通过ADLS gen2 URL读取excel文件时，我收到了与

FileNotFoundError:[Errno 2]相同的错误消息没有这样的文件或目录：'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx'

从Azure Databricks读取Excel文件（
.xlsx
）的步骤，文件位于ADLS Gen 2:

步骤1:装载ADLS Gen2存储帐户

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<application-id>",
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

configs={“fs.azure.account.auth.type”：“OAuth”，
“fs.azure.account.oauth.provider.type”：“org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider”，
“fs.azure.account.oauth2.client.id”：”
参考：
方法pandas.read\u excel
不支持使用wasbs
或abfss
方案URL访问文件。有关详细信息，请参阅
因此，如果您想使用pandas访问该文件，我建议您创建一个sas令牌，并使用带有sas令牌的https
scheme来访问该文件，或者以流的形式下载该文件，然后使用pandas读取该文件。同时，您还可以将存储帐户装载为文件系统，然后以@CHEEKATLAPRADEEP-MSFT的方式访问该文件
比如说

使用sas令牌访问

通过Azure门户创建sas令牌


代码


此外，我们还可以使用pyspark读取excel文件。但我们需要在我们的环境中添加jarcom.crealytics:spark excel
。有关更多详细信息，请参阅和
比如说
通过maven添加包com.crealytics:spark-excel_2.12:0.13.1
。此外，请注意，如果使用scala 2.11，请添加包com.crealytics:spark-excel_2.11:0.13.1

代码

spark.\u jsc.hadoopConfiguration（）.set（“fs.azure.account.key.
根据我的经验，以下是从数据库中的ADLS2读取excel文件的基本步骤：

在我的Databricks群集上安装了以下库

com.crealytics:spark-excel_2.12:0.13.6

添加了以下火花配置

spark.conf.set（adlsAccountKeyName、adlsAccountKeyValue）
adlsAccountKeyName-->fs.azure.account.key.您的\u ADLS\u account\u NAME>.blob.core.windows.net
adlsAccountKeyValue-->adls帐户的sas密钥

使用以下代码从ADLS中的excel文件中获取spark数据框


您是否使用pandas读取excel文件？@JimXu:是的，根据文档，pandas。read_excel
不支持wasbs方案：
pdf=pd.read_excel('https://<account name>.dfs.core.windows.net/<file system>/<path>?<sas token>')
print(pdf)

import io

import pandas as pd
from azure.storage.filedatalake import BlobServiceClient
from azure.storage.filedatalake import DataLakeServiceClient

blob_service_client = DataLakeServiceClient(account_url='https://<account name>.dfs.core.windows.net/', credential='<account key>')

file_client = blob_service_client.get_file_client(file_system='test', file_path='data/sample.xlsx')
with io.BytesIO() as f:
  downloader =file_client.download_file()
  b=downloader.readinto(f)
  print(b)
  df=pd.read_excel(f)
  print(df)

spark._jsc.hadoopConfiguration().set("fs.azure.account.key.<account name>.dfs.core.windows.net",'<account key>')

print("use spark")
df=sqlContext.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .load('abfss://test@testadls05.dfs.core.windows.net/data/sample.xlsx')

df.show()

myDataFrame = (spark.read.format("com.crealytics.spark.excel")
            .option("dataAddress", "'Sheetname'!")
          .option("header", "true")
          .option("treatEmptyValuesAsNulls", "true")
          .option("inferSchema", "false") 
          .option("addColorColumns", "false") 
          .option("startColumn", 0) 
          .option("endColumn", 99)  
          .option("timestampFormat", "dd-MM-yyyy HH:mm:ss")
          .load(FullFilePathExcel)
          )