Python 将满足上次修改窗口的S3文件读取到数据帧中_Python_Python 3.x_Pandas_Apache Spark_Boto3

Python 将满足上次修改窗口的S3文件读取到数据帧中

python python-3.x pandas apache-spark

Python 将满足上次修改窗口的S3文件读取到数据帧中,python,python-3.x,pandas,apache-spark,boto3,Python,Python 3.x,Pandas,Apache Spark,Boto3,我有一个S3存储桶，其中的对象最近修改的范围从非常旧到当前。我需要能够在一个窗口中找到带有最后修改标记的文件，然后将这些文件（JSON）读入某种数据帧（pandas、spark等）我尝试收集文件，单独读取并通过以下代码进行附加，但速度非常慢： session = boto3.session.Session(region_name=region) #Gather all keys that have a modified stamp between max_previous_data_extr

我有一个S3存储桶，其中的对象最近修改的范围从非常旧到当前。我需要能够在一个窗口中找到带有最后修改标记的文件，然后将这些文件（JSON）读入某种数据帧（pandas、spark等）

我尝试收集文件，单独读取并通过以下代码进行附加，但速度非常慢：

session = boto3.session.Session(region_name=region)

#Gather all keys that have a modified stamp between max_previous_data_extracted_timestamp and start_time_proper
s3 = session.resource('s3', region_name=region)
bucket = s3.Bucket(args.sourceBucket)
app_body = []
for obj in bucket.objects.all():
    obj_datetime = obj.last_modified.replace(tzinfo=None)
    if args.accountId + '/Patient' in obj.key and obj_datetime > max_previous_data_extracted_timestamp_datetime and obj_datetime <= start_time_datetime:
        obj_df = pd.read_csv(obj.get()['Body'])
        app_body.append(obj_df)

merged_dataframe = pd.concat(app_body)

session=boto3.session.session（region\u name=region）
#收集在max_previous_data_extracted_timestamp和start_time_property之间具有修改戳记的所有密钥
s3=session.resource（'s3'，region\u name=region）
bucket=s3.bucket（args.sourceBucket）
app_body=[]
对于bucket.objects.all（）中的obj：
obj_datetime=obj.last_modified.replace（tzinfo=None）
如果obj.key和obj_datetime>max_previous_data_extracted_timestamp_datetime和obj_datetime中的args.accountId+“/Patient”，则Spark是执行此操作的一种方法
当使用大量文件与S3 bucket交谈时，我们始终需要记住，列出bucket中的所有对象是昂贵的，因为它一次返回1000个对象，并返回一个指针来获取下一个集合。这使得并行化非常困难，除非您了解结构并使用它来优化这些调用
很抱歉，如果代码不起作用，我使用scala，但它应该几乎处于工作状态
知道您的结构是bucket/account\u identifier/Patient/Patient\u identifier
：
# account_identifiers -- provided from DB
accounts_df = sc.parallelize(account_identifiers, number_of_partitions)
paths = accounts_df.mapPartitions(fetch_files_for_account).collect()
df = spark.read.json(paths)


def fetch_files_for_account(accounts):
    s3 = boto3.client('s3')
    result = []
    for a in accounts:
        marker = ''
        while True:
            request_result = s3.list_objects(Bucket=args.sourceBucket, Prefix=a)
            items = request_result['Contents']
            for i in items:
                obj_datetime = i['LastModified'].replace(tzinfo=None)
                if obj_datetime > max_previous_data_extracted_timestamp_datetime and obj_datetime <= start_time_datetime:
                    result.append('s3://' + args.sourceBucket +'/' + i['Key'])
            if not request_result['IsTruncated']:
                break
            else:
                marker = request_result['Marker']
    return iter(result)

S3桶内的结构是什么？文件名有规律吗？您有多少个文件？结构是bucket/account\u identifier/Patient/Patient\u identifier，其中标识符是UUID样式的字符串，Patient\u identifier是文件名，一个JSON文件。我从静态目录中提取的文件大约有8K个，指定了一个帐户标识符。例如：s3://bucket\u name/123g9999-c424-4662-86c8-f99cae5bb51e/Patient/3748295d-3b78-4927-b4fc-4b33ad7gev8aIn JSON，是对象<代码>新行分隔的（每行一个JSON对象）？否，每个文件都是一个JSON对象，放在一行上。这太棒了！非常感谢杜桑！我确实需要花一些时间来启动和运行s3a文件系统，以便能够接收路径，但这并不太糟糕。
spark.read.json(paths)