Apache spark 为什么spark save作业有4个阶段?

Apache spark 为什么spark save作业有4个阶段?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我正在尝试将数据帧保存到HDFS位置。但我的储蓄需要很长时间。在此之前的操作是使用Spark SQL连接两个表。需要知道为什么save分为四个阶段,以及如何提高性能。我把舞台清单附在这里。 我还附上了我的代码片段 火花代码: 此函数从主类获取数据,模型变量从XML获取表信息数据。最初,它获取源表的数据,然后尝试从其他联接表检索数据 def sourceGen(spark: SparkSession, minBatchLdNbr: Int,

我正在尝试将数据帧保存到HDFS位置。但我的储蓄需要很长时间。在此之前的操作是使用Spark SQL连接两个表。需要知道为什么save分为四个阶段,以及如何提高性能。我把舞台清单附在这里。 我还附上了我的代码片段

火花代码:

此函数从主类获取数据,模型变量从XML获取表信息数据。最初,它获取源表的数据,然后尝试从其他联接表检索数据

    def sourceGen(spark: SparkSession,
                minBatchLdNbr: Int,
                maxBatchLdNbr: Int,
                batchLdNbrList: String,
                models: (GModel, TModel, NModel)): Unit = {
    val configJson = models._3
    val gblJson = models._1
    println("Source Loading started")
    val sourceColumns = configJson.transformationJob.sourceDetails.sourceSchema
    val query = new StringBuilder("select ")
    sourceColumns.map { SrcColumn =>
      if (SrcColumn.isKey == "nak") {
        query.append(
          "cast(" + SrcColumn.columnExpression + " as " + SrcColumn.columnDataType + ") as " + SrcColumn.columnName + ",")
      }
    }
    var tableQuery: String =
      if (!configJson.transformationJob.sourceDetails.sourceTableSchemaName.isEmpty) {
        if (!batchLdNbrList.trim.isEmpty)
          query.dropRight(1) + " from " + configJson.transformationJob.sourceDetails.sourceTableSchemaName + "." + configJson.transformationJob.sourceDetails.sourceTableName + " where batch_ld_nbr > " + minBatchLdNbr + " and batch_ld_nbr <= " + maxBatchLdNbr + "or batch_ld_nbr in ( " + batchLdNbrList + " )"
        else
          query.dropRight(1) + " from " + configJson.transformationJob.sourceDetails.sourceTableSchemaName + "." + configJson.transformationJob.sourceDetails.sourceTableName + " where batch_ld_nbr > " + minBatchLdNbr + " and batch_ld_nbr <= " + maxBatchLdNbr
      } else {
        if (!batchLdNbrList.trim.isEmpty)
          query.dropRight(1) + " from " + gblJson.gParams.sourceTableSchemaName + "." + configJson.transformationJob.sourceDetails.sourceTableName + " where batch_ld_nbr > " + minBatchLdNbr + " and batch_ld_nbr <= " + maxBatchLdNbr + "or batch_ld_nbr in ( " + batchLdNbrList + " )"
        else
          query.dropRight(1) + " from " + gblJson.gParams.sourceTableSchemaName + "." + configJson.transformationJob.sourceDetails.sourceTableName + " where batch_ld_nbr > " + minBatchLdNbr + " and batch_ld_nbr <= " + maxBatchLdNbr
      }
    if (minBatchLdNbr == 0 && maxBatchLdNbr == 0) {
      tableQuery = tableQuery.split("where")(0)
    }
    println("Time"+LocalDateTime.now());
    val tableQueryDf: DataFrame = spark.sql(tableQuery)
    println("tableQueryDf"+tableQueryDf);
    println("Time"+LocalDateTime.now());
    println("Source Loading ended")
    println("Parent Loading Started")
    val parentColumns = configJson.transformationJob.sourceDetails.parentTables
    val parentSourceJoinDF: DataFrame = if (!parentColumns.isEmpty) {
      parentChildJoin(tableQueryDf,
                      parentColumns,
                      spark,
                      gblJson.gParams.pSchemaName)
    } else {
      tableQueryDf
    }
    println("tableQueryDf"+tableQueryDf);
    println("Parent Loading ended")
    println("Key Column Generation Started")
    println("Time"+LocalDateTime.now());
    val arrOfCustomExprs = sourceColumns
      .filter(_.isKey.toString != "nak")
      .map(
        f =>
          functions
            .expr(f.columnExpression)
            .as(f.columnName)
            .cast(f.columnDataType))
    val colWithExpr = parentSourceJoinDF.columns.map(f =>
      parentSourceJoinDF.col(f)) ++ arrOfCustomExprs
    val finalQueryDF = parentSourceJoinDF.select(colWithExpr: _*)
    println("finalQueryDF"+finalQueryDF);
    println("Time"+LocalDateTime.now());
    keyGenUtils.writeParquetTemp(
      finalQueryDF,
      configJson.transformationJob.globalParams.hdfsInterimPath + configJson.transformationJob.sourceDetails.sourceTableName + "/temp_" + configJson.transformationJob.sourceDetails.sourceTableName
    )
    println("PrintedTime"+LocalDateTime.now());
    println("Key Column Generation Ended")
  }
Spark提交配置:

    /usr/hdp/2.6.3.0-235/spark2/bin//spark-submit --master yarn --deploy-mode client --driver-memory 30G --executor-memory 25G --executor-cores 6 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.sql.autoBroadcastJoinThreshold=774857600 --conf spark.kryoserializer.buffer.max.mb=512 --conf spark.dynamicAllocation.maxExecutors=40 --conf spark.eventLog.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.parquet.binaryAsString=true  --conf spark.sql.broadcastTimeout=36000 --conf spark.sql.shuffle.partitions=500

从所附的
图像中
看起来像=>您正在数据集1(1277个分区)上执行
重新分区

阶段10->读取并设置“压缩”数据集的阶段边界,该数据集具有1277个文件总数*每个文件的块大小(近似于最大可用内核)=>无序写入

阶段11->按照默认的spark.sql.Shuffle.partitions(设置为500)洗牌此数据集

阶段12->读取并创建第二个数据集的阶段边界,该数据集具有348个文件总数*每个文件的块大小(近似于最大可用内核)=>无序写入

阶段13->连接这两个数据集,并按照默认的spark.sql.shuffle.partitions(设置为500)保存在HDFS中


通过代码,我们可以看到问题的症结所在,但最后一次加入有巨大的Shuffle R%ead,因此您可能希望降低默认的Shuffle分区。

您需要实际发布代码,而不仅仅是spark submit语句,以便任何人帮助您。添加了代码。。
    def writeParquetTemp(df: DataFrame, hdfsPath: String): Unit = {
    df.write.format("parquet").option("compression", "none").mode(SaveMode.Overwrite).save(hdfsPath)
  }
    /usr/hdp/2.6.3.0-235/spark2/bin//spark-submit --master yarn --deploy-mode client --driver-memory 30G --executor-memory 25G --executor-cores 6 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.sql.autoBroadcastJoinThreshold=774857600 --conf spark.kryoserializer.buffer.max.mb=512 --conf spark.dynamicAllocation.maxExecutors=40 --conf spark.eventLog.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.parquet.binaryAsString=true  --conf spark.sql.broadcastTimeout=36000 --conf spark.sql.shuffle.partitions=500