Scala 火花SQL冻结

Scala 火花SQL冻结,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我对Spark SQL有问题。我从csv文件中读取了一些数据。接下来我做groupBy和join操作,完成的任务是将连接的数据写入文件。我的问题是我在下面的日志上用空格显示的时间间隔 18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1069 18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1003 18/08/07 23:39:40

我对Spark SQL有问题。我从csv文件中读取了一些数据。接下来我做groupBy和join操作,完成的任务是将连接的数据写入文件。我的问题是我在下面的日志上用空格显示的时间间隔

18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1069
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1003
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 965
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1073
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1038
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 900
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 903
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 938
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on 10.4.110.24:36423 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on omm104.in.nawras.com.om:43133 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 969
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1036
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 970
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1006
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1039
18/08/07 23:39:47 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
18/08/07 23:39:54 INFO parquet.ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter

18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters: 
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with: 
30个文件~85K记录的处理时间都奇高~38min。
您是否见过类似的问题?

尽量避免重新分区调用,因为它会导致节点内不必要的数据移动

根据学习火花

请记住,重新划分数据是一项相当昂贵的操作。Spark还有一个名为coalesce的优化版本的重新分区,它允许避免数据移动,但前提是要减少RDD分区的数量


用一种简单的方法来说,COALESCE:-只是为了减少分区的数量,而不是数据的混乱,它只是压缩分区。

这并不完全准确。Coalesce最小化了数据的混乱,因为它用于减少分区的数量。而重新分区可以减少或增加分区的数量。不过,它并没有完全消除数据的洗牌。这还取决于洗牌标志是否设置为true或false
val parentDF = ...
val childADF = ...
val childBDF = ...

val aggregatedAColName = "CHILD_A"
val aggregatedBColName = "CHILD_B"

val columns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3", "val_0")
val keyColumns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3")

val nestedAColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedAColName)
val childADataFrame = childADF
  .select(nestedAColumns: _*)
  .repartition(keyColumns.map(col): _*)
  .groupBy(keyColumns.map(col): _*)
  .agg(collect_list(aggregatedAColName).alias(aggregatedAColName))
val joinedWithA = parentDF.join(childADataFrame, keyColumns, "left")

val nestedBColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedBColName)
val childBDataFrame = childBDF
  .select(nestedBColumns: _*)
  .repartition(keyColumns.map(col): _*)
  .groupBy(keyColumns.map(col): _*)
  .agg(collect_list(aggregatedBColName).alias(aggregatedBColName))
val joinedWithB = joinedWithA.join(childBDataFrame, keyColumns, "left")