Aws glue AWS胶水ETL不'；t输出所有记录_Aws Glue

Aws glue AWS胶水ETL不'；t输出所有记录

Aws glue AWS胶水ETL不'；t输出所有记录,aws-glue,Aws Glue,我有一个ETL脚本，打算使用Relationalize将一组400万个JSON文件展平。该脚本在300个文件的测试集上运行良好，但在具有400万个文件的S3存储桶上运行时，它只生成1500个输出文件，每个文件包含单个记录的数据我尝试了此脚本的几种不同配置，但它们都产生相同的结果： import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.contex

我有一个ETL脚本，打算使用Relationalize将一组400万个JSON文件展平。该脚本在300个文件的测试集上运行良好，但在具有400万个文件的S3存储桶上运行时，它只生成1500个输出文件，每个文件包含单个记录的数据

我尝试了此脚本的几种不同配置，但它们都产生相同的结果：

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Begin variables to customize with your information
glue_source_database = "mydatabase"
glue_source_table = "mytable"
glue_temp_storage = "s3://my-data/glue_temp"
glue_relationalize_output_s3_path = "s3://my-data/glue_output/mytable_flat/"
dfc_root_table_name = "root" #default value is "roottable"
# End variables to customize with your information


datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
dfc = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
origdata = dfc.select(dfc_root_table_name)

origdataoutput = glueContext.write_dynamic_frame.from_options(frame = origdata, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path}, format = "json", transformation_ctx = "origdataoutput")

看起来您只是在将根表传递给

glueContext。从\u目录创建\u动态\u框架。执行relationalize时，它将返回DynamicFrame集合。要查看此集合中的动态帧列表，请尝试使用dfc.keys（）
打印它们
请参阅中的步骤6，以了解relationalize是如何工作的。
进一步深入研究后，relationalize似乎确实删除了记录，原因尚不清楚。如果我计算源动态帧中的记录数，然后运行Relationalize并计算根表，则该数字低于源动态帧中的对象总数。