Apache spark 为什么spark save作业有4个阶段?
我正在尝试将数据帧保存到HDFS位置。但我的储蓄需要很长时间。在此之前的操作是使用Spark SQL连接两个表。需要知道为什么save分为四个阶段,以及如何提高性能。我把舞台清单附在这里。 我还附上了我的代码片段 火花代码: 此函数从主类获取数据,模型变量从XML获取表信息数据。最初,它获取源表的数据,然后尝试从其他联接表检索数据Apache spark 为什么spark save作业有4个阶段?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我正在尝试将数据帧保存到HDFS位置。但我的储蓄需要很长时间。在此之前的操作是使用Spark SQL连接两个表。需要知道为什么save分为四个阶段,以及如何提高性能。我把舞台清单附在这里。 我还附上了我的代码片段 火花代码: 此函数从主类获取数据,模型变量从XML获取表信息数据。最初,它获取源表的数据,然后尝试从其他联接表检索数据 def sourceGen(spark: SparkSession, minBatchLdNbr: Int,
def sourceGen(spark: SparkSession,
minBatchLdNbr: Int,
maxBatchLdNbr: Int,
batchLdNbrList: String,
models: (GModel, TModel, NModel)): Unit = {
val configJson = models._3
val gblJson = models._1
println("Source Loading started")
val sourceColumns = configJson.transformationJob.sourceDetails.sourceSchema
val query = new StringBuilder("select ")
sourceColumns.map { SrcColumn =>
if (SrcColumn.isKey == "nak") {
query.append(
"cast(" + SrcColumn.columnExpression + " as " + SrcColumn.columnDataType + ") as " + SrcColumn.columnName + ",")
}
}
var tableQuery: String =
if (!configJson.transformationJob.sourceDetails.sourceTableSchemaName.isEmpty) {
if (!batchLdNbrList.trim.isEmpty)
query.dropRight(1) + " from " + configJson.transformationJob.sourceDetails.sourceTableSchemaName + "." + configJson.transformationJob.sourceDetails.sourceTableName + " where batch_ld_nbr > " + minBatchLdNbr + " and batch_ld_nbr <= " + maxBatchLdNbr + "or batch_ld_nbr in ( " + batchLdNbrList + " )"
else
query.dropRight(1) + " from " + configJson.transformationJob.sourceDetails.sourceTableSchemaName + "." + configJson.transformationJob.sourceDetails.sourceTableName + " where batch_ld_nbr > " + minBatchLdNbr + " and batch_ld_nbr <= " + maxBatchLdNbr
} else {
if (!batchLdNbrList.trim.isEmpty)
query.dropRight(1) + " from " + gblJson.gParams.sourceTableSchemaName + "." + configJson.transformationJob.sourceDetails.sourceTableName + " where batch_ld_nbr > " + minBatchLdNbr + " and batch_ld_nbr <= " + maxBatchLdNbr + "or batch_ld_nbr in ( " + batchLdNbrList + " )"
else
query.dropRight(1) + " from " + gblJson.gParams.sourceTableSchemaName + "." + configJson.transformationJob.sourceDetails.sourceTableName + " where batch_ld_nbr > " + minBatchLdNbr + " and batch_ld_nbr <= " + maxBatchLdNbr
}
if (minBatchLdNbr == 0 && maxBatchLdNbr == 0) {
tableQuery = tableQuery.split("where")(0)
}
println("Time"+LocalDateTime.now());
val tableQueryDf: DataFrame = spark.sql(tableQuery)
println("tableQueryDf"+tableQueryDf);
println("Time"+LocalDateTime.now());
println("Source Loading ended")
println("Parent Loading Started")
val parentColumns = configJson.transformationJob.sourceDetails.parentTables
val parentSourceJoinDF: DataFrame = if (!parentColumns.isEmpty) {
parentChildJoin(tableQueryDf,
parentColumns,
spark,
gblJson.gParams.pSchemaName)
} else {
tableQueryDf
}
println("tableQueryDf"+tableQueryDf);
println("Parent Loading ended")
println("Key Column Generation Started")
println("Time"+LocalDateTime.now());
val arrOfCustomExprs = sourceColumns
.filter(_.isKey.toString != "nak")
.map(
f =>
functions
.expr(f.columnExpression)
.as(f.columnName)
.cast(f.columnDataType))
val colWithExpr = parentSourceJoinDF.columns.map(f =>
parentSourceJoinDF.col(f)) ++ arrOfCustomExprs
val finalQueryDF = parentSourceJoinDF.select(colWithExpr: _*)
println("finalQueryDF"+finalQueryDF);
println("Time"+LocalDateTime.now());
keyGenUtils.writeParquetTemp(
finalQueryDF,
configJson.transformationJob.globalParams.hdfsInterimPath + configJson.transformationJob.sourceDetails.sourceTableName + "/temp_" + configJson.transformationJob.sourceDetails.sourceTableName
)
println("PrintedTime"+LocalDateTime.now());
println("Key Column Generation Ended")
}
Spark提交配置:
/usr/hdp/2.6.3.0-235/spark2/bin//spark-submit --master yarn --deploy-mode client --driver-memory 30G --executor-memory 25G --executor-cores 6 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.sql.autoBroadcastJoinThreshold=774857600 --conf spark.kryoserializer.buffer.max.mb=512 --conf spark.dynamicAllocation.maxExecutors=40 --conf spark.eventLog.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.parquet.binaryAsString=true --conf spark.sql.broadcastTimeout=36000 --conf spark.sql.shuffle.partitions=500
从所附的
图像中
看起来像=>您正在数据集1(1277个分区)上执行重新分区
阶段10->读取并设置“压缩”数据集的阶段边界,该数据集具有1277个文件总数*每个文件的块大小(近似于最大可用内核)=>无序写入
阶段11->按照默认的spark.sql.Shuffle.partitions(设置为500)洗牌此数据集
阶段12->读取并创建第二个数据集的阶段边界,该数据集具有348个文件总数*每个文件的块大小(近似于最大可用内核)=>无序写入
阶段13->连接这两个数据集,并按照默认的spark.sql.shuffle.partitions(设置为500)保存在HDFS中
通过代码,我们可以看到问题的症结所在,但最后一次加入有巨大的Shuffle R%ead
,因此您可能希望降低默认的Shuffle分区。您需要实际发布代码,而不仅仅是spark submit语句,以便任何人帮助您。添加了代码。。
def writeParquetTemp(df: DataFrame, hdfsPath: String): Unit = {
df.write.format("parquet").option("compression", "none").mode(SaveMode.Overwrite).save(hdfsPath)
}
/usr/hdp/2.6.3.0-235/spark2/bin//spark-submit --master yarn --deploy-mode client --driver-memory 30G --executor-memory 25G --executor-cores 6 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.sql.autoBroadcastJoinThreshold=774857600 --conf spark.kryoserializer.buffer.max.mb=512 --conf spark.dynamicAllocation.maxExecutors=40 --conf spark.eventLog.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.parquet.binaryAsString=true --conf spark.sql.broadcastTimeout=36000 --conf spark.sql.shuffle.partitions=500