Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/sql-server-2008/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Aws glue AWS胶水ETL不';t输出所有记录_Aws Glue - Fatal编程技术网

Aws glue AWS胶水ETL不';t输出所有记录

Aws glue AWS胶水ETL不';t输出所有记录,aws-glue,Aws Glue,我有一个ETL脚本,打算使用Relationalize将一组400万个JSON文件展平。该脚本在300个文件的测试集上运行良好,但在具有400万个文件的S3存储桶上运行时,它只生成1500个输出文件,每个文件包含单个记录的数据 我尝试了此脚本的几种不同配置,但它们都产生相同的结果: import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.contex

我有一个ETL脚本,打算使用Relationalize将一组400万个JSON文件展平。该脚本在300个文件的测试集上运行良好,但在具有400万个文件的S3存储桶上运行时,它只生成1500个输出文件,每个文件包含单个记录的数据

我尝试了此脚本的几种不同配置,但它们都产生相同的结果:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Begin variables to customize with your information
glue_source_database = "mydatabase"
glue_source_table = "mytable"
glue_temp_storage = "s3://my-data/glue_temp"
glue_relationalize_output_s3_path = "s3://my-data/glue_output/mytable_flat/"
dfc_root_table_name = "root" #default value is "roottable"
# End variables to customize with your information


datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
dfc = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
origdata = dfc.select(dfc_root_table_name)

origdataoutput = glueContext.write_dynamic_frame.from_options(frame = origdata, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path}, format = "json", transformation_ctx = "origdataoutput")

看起来您只是在将根表传递给
glueContext。从\u目录创建\u动态\u框架。执行relationalize时,它将返回DynamicFrame集合。要查看此集合中的动态帧列表,请尝试使用
dfc.keys()
打印它们


请参阅中的步骤6,以了解relationalize是如何工作的。

进一步深入研究后,relationalize似乎确实删除了记录,原因尚不清楚。如果我计算源动态帧中的记录数,然后运行Relationalize并计算根表,则该数字低于源动态帧中的对象总数。