Amazon web services 在大型动态框架中循环以输出到S3,从而绕过';maxResultSize';错误

Amazon web services 在大型动态框架中循环以输出到S3,从而绕过';maxResultSize';错误,amazon-web-services,pyspark,etl,aws-glue,Amazon Web Services,Pyspark,Etl,Aws Glue,我在AWS胶水ETL作业中有一个大型动态框架。当尝试将此数据输出到S3时,由于任务太大,因此失败 错误: 原因:org.apache.spark.sparkeexception:由于阶段失败而中止作业:3225个任务的序列化结果的总大小(1024.0 MB)大于spark.driver.maxResultSize(1024.0 MB) 我相信一个很好的解决方案是按日期分离DynamicFrame,循环每个日期的数据,并以较小的块输出。也许是这样的: for eventDateParam in m

我在AWS胶水ETL作业中有一个大型动态框架。当尝试将此数据输出到S3时,由于任务太大,因此失败

错误:

原因:org.apache.spark.sparkeexception:由于阶段失败而中止作业:3225个任务的序列化结果的总大小(1024.0 MB)大于spark.driver.maxResultSize(1024.0 MB)

我相信一个很好的解决方案是按日期分离DynamicFrame,循环每个日期的数据,并以较小的块输出。也许是这样的:

for eventDateParam in mapped_datasource0_general.eventDate:
    partitioned_dataframe_general = mapped_datasource0_general.where(eventDate = eventDateParam)
    dataoutput_general = glueContext.write_dynamic_frame.from_options(frame = partitioned_dataframe_general, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path_general, "partitionKeys": ["eventDate"]}, format = "parquet", transformation_ctx = "dataoutput_general")
我对AWS胶水比较陌生,在这里试图找到解决方法时遇到了各种各样的错误。如有任何建议,我们将不胜感激

干杯

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

编辑:

更长的tracestack:

Traceback (most recent call last):
File "script_2018-06-19-22-36-11.py", line 63, in <module>
glueContext.write_dynamic_frame.from_options(frame = partitioned_mapped_personal_DF, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path_personal, "partitionKeys": ["eventDate"]}, format = "parquet", transformation_ctx = "dataoutput_personal")
File "/mnt/yarn/usercache/root/appcache/application_1529446917701_0002/container_1529446917701_0002_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 572, in from_options
File "/mnt/yarn/usercache/root/appcache/application_1529446917701_0002/container_1529446917701_0002_01_000001/PyGlue.zip/awsglue/context.py", line 191, in write_dynamic_frame_from_options
File "/mnt/yarn/usercache/root/appcache/application_1529446917701_0002/container_1529446917701_0002_01_000001/PyGlue.zip/awsglue/context.py", line 214, in write_from_options
File "/mnt/yarn/usercache/root/appcache/application_1529446917701_0002/container_1529446917701_0002_01_000001/PyGlue.zip/awsglue/data_sink.py", line 32, in write
File "/mnt/yarn/usercache/root/appcache/application_1529446917701_0002/container_1529446917701_0002_01_000001/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame
File "/mnt/yarn/usercache/root/appcache/application_1529446917701_0002/container_1529446917701_0002_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1529446917701_0002/container_1529446917701_0002_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1529446917701_0002/container_1529446917701_0002_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1198.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:213)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at com.amazonaws.services.glue.SparkSQLDataSink.writeDynamicFrame(DataSink.scala:123)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 3109 tasks (1024.3 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:186)
... 45 more

您是说要将数据输出到S3。如果您没有进行大量聚合(groupBy或join),则
数据帧的大小应该无关紧要-移动数据只需要更长的时间

您收到的错误表明您正试图将数据帧流式传输到驱动程序:

spark.driver.maxResultSize
是一种配置,用于控制现在许多数据可以从执行器流回到驱动程序。这通常发生在调用
DataFrame.collect
DataFrame.collectAsList
时。查看更多详细信息

你说你会犯各种各样的错误——也许你的工作在别的地方引起了错误?你能分享你所有的胶水代码吗?也许还有错误的痕迹

检查Cloud Watch日志时,请确保将视图从行更改为文本:

我遇到的各种错误都与我尝试的修复有关。原始错误日志中的附加示例:
回溯(最近一次调用):文件“script\u 2018-06-14-15-45-26.py”,第74行,在dataoutput\u personal=glueContext.write\u dynamic\u frame.from\u options(frame=mapped\u datasource0\u personal,connection\u type=“s3”,connection\u options={“path”:glue\u relationalize\u output\u s3\u path\u personal,“partitionKeys”:[“eventDate”]},format=“parquet”,transformation\u ctx=“dataoutput\u personal”)
删除
dataoutput\u personal=
帮助我吗?这是否等同于
.collect()
?考虑到这是脚本的最后一个操作,所以我实际上不使用动态框架。我不认为
write\u dynamic\u frame
会在内部调用
collect
。您没有对
mapped\u datasource0\u personal
dataframe进行任何转换吗?
mapped\u datasource0\u personal
是c从代码中输出行前面的PySpark数据帧转换:
mapped_datasource0_personal=DynamicFrame.fromDF(mapped_personal,glueContext,“mapped_datasource0_personal”)
将我要输出到S3的动态框架写入新的动态框架是否可能会导致问题?非常感谢您的帮助@botchniaquet值得一提的是,该作业在较小的样本数据集上运行得非常完美。我已通过数据中的日期参数循环我的输出,因此它在较小的数据块中,并删除了
dataoutput_personal=
,因为它在功能上什么都不做。现在失败了,出现了一个新的错误,这次日志中没有回溯,但AWS Glue UI中的错误消息是:
命令失败,退出代码为1
。很遗憾,很难找到任何有用的信息。
listOfDistinctsPersonal = mapped_personal.select("eventDate").distinct()

#LOOP WRITE PERSONAL
for eventDateParam in listOfDistinctsPersonal:
    partitioned_mapped_personal = mapped_personal.where(col("eventDate") == eventDateParam)
    partitioned_mapped_personal_DF = DynamicFrame.fromDF(partitioned_mapped_personal, glueContext, "partitioned_mapped_personal_DF")
    glueContext.write_dynamic_frame.from_options(frame = partitioned_mapped_personal_DF, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path_personal, "partitionKeys": ["eventDate"]}, format = "parquet", transformation_ctx = "dataoutput_personal")