Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark 2.1:将数据帧写入拼花地板文件时内存不足?_Scala_Apache Spark_Out Of Memory_Spark Dataframe - Fatal编程技术网

Scala Spark 2.1:将数据帧写入拼花地板文件时内存不足?

Scala Spark 2.1:将数据帧写入拼花地板文件时内存不足?,scala,apache-spark,out-of-memory,spark-dataframe,Scala,Apache Spark,Out Of Memory,Spark Dataframe,我试图将一个数据帧(约1400万行)写入本地拼花文件,但这样做的内存一直不足。我有一个大映射val myMap:map[String,Seq[Double]]并通过val newDF=df.withColumn(“stuff”,udfWithMap)在udf中为一个非常大的数据帧使用这个映射。我有128G的RAM可用,在仅将数据帧持久化到磁盘_并执行df.show之后,我仍然有大约100G的RAM。然而,当我尝试df.write.parquet时,内存火花的数量需要猛增,我的内存不足。我也尝试过

我试图将一个数据帧(约1400万行)写入本地拼花文件,但这样做的内存一直不足。我有一个大映射
val myMap:map[String,Seq[Double]]
并通过
val newDF=df.withColumn(“stuff”,udfWithMap)
udf
中为一个非常大的数据帧使用这个映射。我有128G的RAM可用,在仅将数据帧持久化到磁盘_并执行
df.show
之后,我仍然有大约100G的RAM。然而,当我尝试
df.write.parquet
时,内存火花的数量需要猛增,我的内存不足。我也尝试过播放
myMap
,但这样做似乎对记忆没有任何影响。有什么问题

这是我的代码示例:

scala> type LookupMapSeq = (String, Seq[Double])

scala> val myMap = sc.objectFile[LookupMapSeq]("file:///data/dir/myMap").collectAsMap()

/* myMap.size is about 150,000 and each Seq[String] is of size 200 */

scala> val combineudf = functions.udf[Seq[Double], Seq[String]] { v1 =>
  val wordVec = v1.map(y => myMap.getOrElse(y, Seq.fill(200)(0.0)))
  wordVec.foldLeft(Seq.fill(200)(0.0)) { case (acc, list) =>
    acc.zipWithIndex.map { case (value, i) => value + list(i) }
  }
}

scala> import org.apache.spark.storage.StorageLevel

scala> val df6 = df3.withColumn("sum", combineudf(df3("filtered"))).persist(StorageLevel.DISK_ONLY)

scala> df6.show

+--------+--------------------+--------------------+-------------------
-+
|    pmid|            filtered|            TFIDFvec|                 sum|
+--------+--------------------+--------------------+--------------------+
|25393341|[retreatment, rec...|[0.0, 26.21009534...|[4.34963607663623...|
|25394466|[lactate, dehydro...|[21.3762879413052...|[-17.550128685500...|
|25394717|[aim, study, inve...|[3.11641169932197...|[-54.981726214632...|    

scala> df6.write.parquet("file:///data/dir/df6")

/* This results in out of memory */