Azure DELTA MERGE中的Spark非描述性错误

Azure DELTA MERGE中的Spark非描述性错误,azure,apache-spark,databricks,azure-databricks,delta-lake,Azure,Apache Spark,Databricks,Azure Databricks,Delta Lake,我正在使用DataRicks(DataRicks Runtime 8)中的Spark 3.1和一个非常大的集群(25个工作线程,每个工作线程有112 Gb内存和16个内核)来复制Azure Data Lake存储(ADLS gen2)中的几个SAP表。为此,一个工具将所有这些表的增量写入一个中间系统(SQL Server),然后,如果我有某个表的新数据,我将执行DataRicks作业,将新数据与ADL中可用的现有数据合并 这个过程对于大多数表来说都很好,但是其中一些表(最大的表)需要花费大量时间

我正在使用DataRicks(DataRicks Runtime 8)中的Spark 3.1和一个非常大的集群(25个工作线程,每个工作线程有112 Gb内存和16个内核)来复制Azure Data Lake存储(ADLS gen2)中的几个SAP表。为此,一个工具将所有这些表的增量写入一个中间系统(SQL Server),然后,如果我有某个表的新数据,我将执行DataRicks作业,将新数据与ADL中可用的现有数据合并

这个过程对于大多数表来说都很好,但是其中一些表(最大的表)需要花费大量时间进行合并(我使用每个表的PK合并数据),而最大的一个表从一周前(当生成表的一个大增量时)开始失败。我可以在作业中看到的错误跟踪:

Py4JJavaError:调用o233.sql时出错。 :org.apache.spark.sparkeexception:作业已中止。 位于org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:234) 在com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.$anonfun$writeFiles$5(TransactionalWriteEdge.scala:246) ... .. ............................................................................................................................................................................................................................................................................................................................................................................ 原因:org.apache.spark.SparkException:结果中引发异常: 在org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:428)上 在com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.awaitShuffleMapStage$1(DeltaOptimizedWriterExec.scala:153) 位于com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.getShuffleStats(DeltaOptimizedWriterExec.scala:158) 位于com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.computeBins(DeltaOptimizedWriterExec.scala:106) 在com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.doExecute(DeltaOptimizedWriterExec.scala:174) 位于org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) 位于org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) 位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) 位于org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) 位于org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) 位于org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:180) ... 141更多 原因:org.apache.spark.SparkException:作业因阶段失败而中止:ShuffleMapStage 68(在DeltaOptimizedWriterExec.scala:97处执行)失败的最大允许次数:4。最近的失败原因:org.apache.spark.shuffle.FetchFailedException:org.apache.spark.storage.ShuffleBlockFetcheritor.throwFetchFailedException(ShuffleBlockFetcheritor.scala:769)与/XXX.XX.XX.XX:4048的连接在org.apache.spark.spark.storage.ShuffleBlockFetcheritor.next(ShuffleBlockFetcheritor.scala:684)关闭位于org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:69)在 ... java.lang.Thread.run(Thread.java:748)由以下原因引起:java.io.IOException:org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)处关闭的/XXX.XX.XX.XX:4048连接在org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)处在io.netty.channel.channel.AbstractChannelHandlerContext.InvokeChannelActive(AbstractChannelHandlerContext.java:262)在io.netty.channel.AbstractChannelHandlerContext.InvokeChannelActive(AbstractChannelHandlerContext.java:248)在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)上在io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)的io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)和io.netty.channel.AbstractChannelHandlerContext.InvokeChannelActive(AbstractChannelHandlerContext.java:262)上在io.netty.channel.channel.AbstractChannelHandlerContext.InvokeChannelActive(AbstractChannelHandlerContext.java:248)在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)在io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)上位于io.netty.channel.AbstractChannelHandlerContext.InvokeChannelActive(AbstractChannelHandlerContext.java:262)的io.netty.channel.AbstractChannelHandlerContext.InvokeChannelActive(AbstractChannelHandlerContext.java:262)上的org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)在io.netty.channel.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)在io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)在io.netty.channel.AbstractChannelHandlerContext.InvokeChannelLinactive(AbstractChannelHandlerContext.java:262)上在io.netty.channel.channel.AbstractChannelHandlerContext.InvokeChannelActive(AbstractChannelHandlerContext.java:248)在io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)在io.netty.channel.AbstractChannel$AbstractSafe$8.run(AbstractChannel.java:818)在io.netty.util.concurrent