Scala 并行启动时火花作业冻结_Scala_Apache Spark_Scala Collections

Scala 并行启动时火花作业冻结

scala apache-spark

Scala 并行启动时火花作业冻结,scala,apache-spark,scala-collections,Scala,Apache Spark,Scala Collections,我想将一组时间序列数据从多个csv文件转换为标签点，并保存到拼花文件。Csv文件很小，通常小于10MB 当我用ParArray启动它时，它一次提交4个作业并冻结。这里的代码 val idx = Another_DataFrame ListFiles(new File("data/stock data")) .filter(_.getName.contains(".csv")).zipWithIndex .par //comment this line and code runs smoothly

我想将一组时间序列数据从多个csv文件转换为

标签点

，并保存到拼花文件。Csv文件很小，通常小于10MB

当我用ParArray启动它时，它一次提交4个作业并冻结。这里的代码

val idx = Another_DataFrame
ListFiles(new File("data/stock data"))
.filter(_.getName.contains(".csv")).zipWithIndex
.par //comment this line and code runs smoothly
.foreach{
  f=>
      val stk = spark_csv(f._1.getPath) //doing good
      ColMerge(stk,idx,RESULT_PATH(f)) //freeze here
    stk.unpersist()
}

冻结部分：

def ColMerge(ori:DataFrame,index:DataFrame,PATH:String) = {
val df = ori.join(index,ori("date")===index("index_date")).drop("index_date").orderBy("date").cache
val head = df.head
val col = df.columns.filter(e=>e!="code"&&e!="date"&&e!="name")
val toMap = col.filter{
  e=>head.get(head.fieldIndex(e)).isInstanceOf[String]
}.sorted
val toCast = col.diff(toMap).filterNot(_=="data")
val res: Array[((String, String, Array[Double]), Long)] = df.sort("date").map{
  row=>
    val res1= toCast.map{
      col=>
        row.getDouble(row.fieldIndex(col))
    }
    val res2= toMap.flatMap{
      col=>
        val mapping = new Array[Double](GlobalConfig.ColumnMapping(col).size)
        row.getString(row.fieldIndex(col)).split("；").par.foreach{
          word=>
            mapping(GlobalConfig.ColumnMapping(col)(word)) = 1
        }
        mapping
    }

    (
      row.getString(row.fieldIndex("code")),
      row.getString(row.fieldIndex("date")),
      res1++res2++row.getAs[Seq[Double]]("data")
      )
}.zipWithIndex.collect
df.unpersist
val dataset = GlobalConfig.sctx.makeRDD(res.map{
  day=>
    (day._1._1,
      day._1._2,
      try{
        new LabeledPoint(GetHighPrice(res(day._2.toInt+2)._1._3.slice(0,4))/GetLowPrice(res(day._2.toInt)._1._3.slice(0,4))*1.03,Vectors.dense(day._1._3))
      }
      catch {
        case ex:ArrayIndexOutOfBoundsException=>
          new LabeledPoint(-1,Vectors.dense(day._1._3))
      }
      )
}).filter(_._3.label != -1).toDF("code","date","labeledpoint")
dataset.write.mode(SaveMode.Overwrite).parquet(PATH)
}

在

ColMerge

中生成

res

时，冻结的确切作业是

DataFrame.sort（）或zipWithIndex

由于大部分工作都是在collect
之后完成的，我真的很想使用ParArray来加速ColMerge
，但这种奇怪的冻结阻止了我这样做。我是否需要新建一个线程池来执行此操作？
为什么每个线程都写入相同的路径？这是需要的吗？@PankajArora路径是为每个csv文件生成的。我使用结果路径，因为原始表达式太长