Azure Data Lake中的文件夹统计信息_Azure_Analytics_Azure Data Lake_U Sql

Azure Data Lake中的文件夹统计信息

azure

Azure Data Lake中的文件夹统计信息,azure,analytics,azure-data-lake,u-sql,Azure,Analytics,Azure Data Lake,U Sql,我试图总结有多少数据被写入了我的数据湖中的一个文件夹。最好的方法是什么？我应该使用U-SQL作业吗？HDInsights？有两种方法可以做到这一点：如果是一次性操作，您可以使用Azure Storage Explorer（），导航到Data Lake Store文件夹并获取其大小如果您想要一种编程方式来实现这一点，Data Lake Store提供了一个与WebHDFS兼容的API，它可以列出几个文件夹属性：GETCONTENTSUMMARY。您可以在此处查看更多详细信息：希望这有帮助 J

我试图总结有多少数据被写入了我的数据湖中的一个文件夹。最好的方法是什么？我应该使用U-SQL作业吗？HDInsights？

有两种方法可以做到这一点：

如果是一次性操作，您可以使用Azure Storage Explorer（），导航到Data Lake Store文件夹并获取其大小

如果您想要一种编程方式来实现这一点，Data Lake Store提供了一个与WebHDFS兼容的API，它可以列出几个文件夹属性：GETCONTENTSUMMARY。您可以在此处查看更多详细信息：

希望这有帮助

José

您可以使用Python代码在文件中循环。请参阅：

如果您想快速交叉检查：

从Windows应用程序下载Azure存储资源管理器

打开要查看尺寸详细信息的文件夹

在顶部栏菜单上选择更多->文件夹统计信息将有所帮助您可以获得目录的详细信息，包括以字节为单位的大小。请参阅Azure Storage Explorer的附件[示例快照] 菜单[1]][1]

[1] :

下面是帮助获取文件夹/文件统计信息的脚本。另外，请根据您的环境使用vaules验证所有变量

import csv, os, datetime,configparser
from azure.datalake.store import core,lib

# Returns the size of each subdirectory 
def getUsage(adls_client,data,level):
    
    temp=[]
    # Split the path by '/' and store in list
    for i in data:
        temp.append(i.split('/'))

    # Prepare PathList by removing the filenames 
    path=[]
    pathList=[]
    for i in temp:

        # Ensure Depth of the Path is not crossing level
        path=[]
        if len(i)-1 >= level:
            maxDepth = level
        else:
            maxDepth = len(i)-1
            
        for j in range(maxDepth):
        
            if i[j] not in path or i[j] != '_SUCCESS':
                path.append(i[j])
        
        pathList.append(path)
    
    # Remove duplicates
    uniquePaths = set(tuple(x) for x in pathList)
    pathsPreparedDU= list("/".join(x) for x in uniquePaths)
    
    # Get usage for the directories from prepared paths
    answers=[]
    temp=[]
    temp2=""
    blankLevelCnt =0 

    for i in pathsPreparedDU:
        temp=[]
        temp2=""
        usage=adls_client.du(i, deep=True, total=True)
        temp.append(i.split('/'))
        for item in temp:
            if len(item) < level+1:
                blankLevelCnt = (level+1) - len(item)
        temp2=temp2+i
        for j in range(blankLevelCnt):
            temp2=temp2+"/"
        temp2=temp2+str(usage)
        answers.append([temp2])

    # add element for CSV header
    csvList = []
    temp=[]
    temp.append("Date/Time")    
    for i in range(level):
        temp.append("Level "+str(i+1))

    temp.append("Size (Bytes)")    
    temp.append("Size (KB)")    
    temp.append("Size (MB)")    
    temp.append("Size (GB)")    
    temp.append("Size (TB)")    

    csvList.append(temp)
    now = datetime.datetime.now()

    for i in answers:
        usageBytes = int(i[0].split('/')[-1])
        usageKBytes = round(usageBytes/1024,2)
        usageMBytes = round(usageKBytes/1024,2)
        usageGBytes = round(usageMBytes/1024,2)
        usageTBytes = round(usageGBytes/1024,2)
        csvList.append((str(now)[:16]+"/"+i[0]+"/"+str(usageKBytes)+"/"+str(usageMBytes)+"/"+str(usageGBytes)+"/"+str(usageTBytes)).split('/'))

    return csvList

# Returns the alds_client object
def connectADLS(tenant_id,app_id,app_key, adls_name):
    adls_credentials = lib.auth(tenant_id=tenant_id,client_secret=app_key,client_id=app_id)
    return core.AzureDLFileSystem(adls_credentials, store_name=adls_name)

# Returns the all subdirectories under the root directory
def getSubdirectories(adls_client,root_dir):
    return adls_client.walk(root_dir)

# Write to CSV
def writeCSV(root_dir,csvList):
    
    fileprefixes = root_dir.split('/')
    prefix = "root-"
    while('' in fileprefixes) : 
        fileprefixes.remove('') 

    if len(fileprefixes) > 0:
        prefix = ""
        for i in fileprefixes:
            prefix = prefix + i + "-" 
    
    x = datetime.datetime.today().strftime('%Y-%m-%d')

    filename = prefix+""+ x +".csv"

    with open(filename, "w+") as csvFile:
        writer = csv.writer(csvFile,lineterminator='\n')
        writer.writerows(csvList)

    csvFile.close()
    print("file Generated")
    print('##vso[task.setvariable variable=filename;]%s' % (filename))

if __name__ == "__main__":

    # 1. Parse config file and get service principal details
    config = configparser.ConfigParser()
    config.sections()
    config.read('config.ini')
    
    tenant_id=config['SERVICE_PRINCIPAL']['tenant_id']
    app_id=config['SERVICE_PRINCIPAL']['app_id']
    app_key=config['SERVICE_PRINCIPAL']['app_key']
    adls_name = config['ADLS_ACCT']['adls_name'] 
    root_dir = config['ADLS_ACCT']['root_dir'] 
    level = config['ADLS_ACCT']['level'] 

    # 2. Connect to ADLS 
    adls_client = connectADLS(tenant_id,app_id,app_key, adls_name)

    # 3. recursively lists all files
    data = getSubdirectories(adls_client,root_dir)

    # 4. Get usage for the directories
    csvList = getUsage(adls_client,data,int(level))

    # 5. Upload report to blob
    writeCSV(root_dir,csvList)

导入csv、操作系统、日期时间、配置解析器
从azure.datalake.store导入核心，lib
#返回每个子目录的大小
def getUsage（adls_客户端、数据、级别）：
温度=[]
#按“/”拆分路径并存储在列表中
对于数据中的i：
临时追加（i.split（'/'））
#通过删除文件名准备路径列表
路径=[]
路径列表=[]
对于临时工：
#确保路径深度不与水平面交叉
路径=[]
如果len（i）-1>=级别：
最大深度=电平
其他：
maxDepth=len（i）-1
对于范围内的j（最大深度）：
如果i[j]不在路径中或i[j]！='_成功'：
path.append（i[j]）
路径列表。追加（路径）
#删除重复项
uniquepath=set（路径列表中x的元组（x））
pathsPreparedDU=list（“/”。在唯一路径中为x连接（x）
#从准备好的路径获取目录的用法
答案=[]
温度=[]
temp2=“”
blankLevelCnt=0
对于PathSpreparedu中的i：
温度=[]
temp2=“”
用法=adls\u client.du（i，deep=True，total=True）
临时追加（i.split（'/'））
对于temp中的项目：
如果长度（项目）<级别+1：
blankLevelCnt=（级别+1）-len（项目）
temp2=temp2+i
对于范围内的j（blankLevelCnt）：
temp2=temp2+“/”
temp2=temp2+str（用法）
答案。追加（[temp2]）
#为CSV标题添加元素
csvList=[]
温度=[]
临时附加（“日期/时间”）
对于范围内的i（级别）：
临时附加（“级别”+str（i+1））
临时追加（“大小（字节）”）
临时追加（“大小（KB）”）
临时附加（“大小（MB）”）
临时附加（“大小（GB）”）
临时附加（“大小（TB）”）
csvList.append（临时）
now=datetime.datetime.now（）
对于我的回答：
usageBytes=int（i[0]。拆分（'/'）[-1]）
usageKBytes=round（usageBytes/1024,2）
usageMBytes=round（usageKBytes/1024,2）
usageGBytes=round（usageMBytes/1024,2）
usageBytes=round（usageGBytes/1024,2）
csvList.append（（str（now）[：16]+“/”+i[0]+“/”+str（usageKBytes）+“/”+str（usageKBytes）+“/”+str（usageGBytes）+“/”+str（usageGBytes））.split（“/”）
返回csvList
#返回alds_客户端对象
def连接ADL（租户id、应用程序id、应用程序密钥、ADL名称）：
adls\u credentials=lib.auth（租户\u id=租户\u id，客户端\u secret=app\u key，客户端\u id=app\u id）
返回core.AzureDLFileSystem（adls\u凭证，store\u name=adls\u name）
#返回根目录下的所有子目录
def GetSubdirectory（adls_客户端，根目录）：
返回adls\u client.walk（root\u dir）
#写入CSV
def writeCSV（根目录，csvList）：
fileprefixes=root_dir.split（“/”）
前缀=“根-”
而（“”在文件前缀中）：
文件前缀。删除（“”）
如果len（文件前缀）>0：
prefix=“”
对于文件前缀中的i：
前缀=前缀+i+“-”
x=datetime.datetime.today（）.strftime（“%Y-%m-%d”）
文件名=前缀+“”+x+“.csv”
打开（文件名为“w+”）作为csvFile时：
writer=csv.writer（csvFile，lineterminator='\n'）
writer.writerows（csvList）
csvFile.close（）
打印（“生成文件”）
打印（'##vso[task.setvariable=filename；]%s%%（filename））
如果名称=“\uuuuu main\uuuuuuuu”：
# 1. 解析配置文件并获取服务主体详细信息
config=configparser.configparser（）
config.sections（）
config.read（'config.ini'）
租户id=config['SERVICE\u PRINCIPAL']['tenant\u id']
app_id=config['SERVICE_PRINCIPAL']['app_id']
app_key=config['SERVICE_PRINCIPAL']['app_key']
adls_name=config['adls_ACCT']['adls_name']
root\u dir=config['ADLS\u ACCT']['root\u dir']
级别=配置['ADLS\U ACCT']['level']
# 2. 连接到ADL
adls\u客户端=连接adls（租户id、应用程序id、应用程序密钥、adls\u名称）
# 3. 递归地列出所有文件
data=getSubdirectories（adls\u client，root\u dir）
# 4. 获取目录的用法
csvList=getUsage（adls_客户端，数据，整数（级别））
# 5. 将报告上载到blob
writeCSV（根目录，csvList）

Hi Jose，你知道如何使用ADLS gen2实现这一点吗