Scala Spark SQL-作业之间的延迟（广播数据帧）_Scala_Apache Spark_Apache Spark Sql_Google Cloud Dataproc

Scala Spark SQL-作业之间的延迟（广播数据帧）

scala apache-spark

Scala Spark SQL-作业之间的延迟（广播数据帧）,scala,apache-spark,apache-spark-sql,google-cloud-dataproc,Scala,Apache Spark,Apache Spark Sql,Google Cloud Dataproc,我有一个在并行迭代中处理8个数据帧的应用程序。这项工作运行得很好，直到我将这些数据帧与读取csv文件生成的两个非常小的数据帧（小于1kb）合并在一起。加入join之后，应用程序的执行时间增加了很多（超过100%）。查看spark Web UI，我发现一些作业在ThreadPoolExecutor.java:1149上运行，描述如下。这些作业负责广播非常小的数据帧。对于每次并行执行（16次），这些作业针对每个小数据帧运行。每个ThreadPoolExecutor.java:1149执行块的延迟约为

我有一个在并行迭代中处理8个数据帧的应用程序。这项工作运行得很好，直到我将这些数据帧与读取csv文件生成的两个非常小的数据帧（小于1kb）合并在一起。加入join之后，应用程序的执行时间增加了很多（超过100%）。查看spark Web UI，我发现一些作业在ThreadPoolExecutor.java:1149上运行，描述如下。这些作业负责广播非常小的数据帧。对于每次并行执行（16次），这些作业针对每个小数据帧运行。每个ThreadPoolExecutor.java:1149执行块的延迟约为4分钟。问题越来越严重，我以同样的速度添加了一个新的小数据帧，它与8个并行数据帧连接在一起

为了并行化数据帧，我创建了一个列表[DataFrame].par

-此处是要读取和生成csv数据帧的对象：

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.types._

// s"${dataBaseParameters.viewConfigurationDir}view_classification_by_path_clean.csv
 //view_classification_by_subtype.csv"
 //view_classification_by_type.csv"

object ClassificationOverwriteLoader {

 def loadViewClassification (spark: SparkSession, schema: StructType, filePath: String): DataFrame = {
   spark.read
     .format("csv")
     .option("header", "true")
     .option("delimiter", ";")
     .schema(schema)
     .load(filePath)
 }

 val ByPathSchema = StructType(
   Seq(
     StructField("path_clean", StringType,true),
     StructField("group_override_by_path", StringType, true),
     StructField("type_override_by_path", StringType, true),
     StructField("subtype_override_by_path", StringType, true),
     StructField("is_active", BooleanType,true),
     StructField("row_created_date", DateType,true),
     StructField("row_updated_date", DateType,true),
     StructField("row_created_by", StringType,true),
     StructField("row_updated_by", StringType,true)
   )
 )

 val ByTypeSubTypeSchema = StructType(
   Seq(
     StructField("group_override", StringType, true),
     StructField("type_source", StringType, true),
     StructField("type_override", StringType, true),
     StructField("subtype_source", StringType, true),
     StructField("subtype_override", StringType, true),
     StructField("is_active", BooleanType,true),
     StructField("row_created_date", DateType,true),
     StructField("row_updated_date", DateType,true),
     StructField("row_created_by", StringType,true),
     StructField("row_updated_by", StringType,true)
   )
 )

}

我在Google DataProc集群上运行此作业，所有数据都存储在Google文件系统上（包括csv文件）。有没有办法避免这种调度延迟？

看起来作业之间的延迟不是原因，而是引入联接的结果。你可能需要深入挖掘才能找出真正的瓶颈。