Scala Spark 2.1:将数据帧写入拼花地板文件时内存不足?
我试图将一个数据帧(约1400万行)写入本地拼花文件,但这样做的内存一直不足。我有一个大映射Scala Spark 2.1:将数据帧写入拼花地板文件时内存不足?,scala,apache-spark,out-of-memory,spark-dataframe,Scala,Apache Spark,Out Of Memory,Spark Dataframe,我试图将一个数据帧(约1400万行)写入本地拼花文件,但这样做的内存一直不足。我有一个大映射val myMap:map[String,Seq[Double]]并通过val newDF=df.withColumn(“stuff”,udfWithMap)在udf中为一个非常大的数据帧使用这个映射。我有128G的RAM可用,在仅将数据帧持久化到磁盘_并执行df.show之后,我仍然有大约100G的RAM。然而,当我尝试df.write.parquet时,内存火花的数量需要猛增,我的内存不足。我也尝试过
val myMap:map[String,Seq[Double]]
并通过val newDF=df.withColumn(“stuff”,udfWithMap)
在udf
中为一个非常大的数据帧使用这个映射。我有128G的RAM可用,在仅将数据帧持久化到磁盘_并执行df.show
之后,我仍然有大约100G的RAM。然而,当我尝试df.write.parquet
时,内存火花的数量需要猛增,我的内存不足。我也尝试过播放myMap
,但这样做似乎对记忆没有任何影响。有什么问题
这是我的代码示例:
scala> type LookupMapSeq = (String, Seq[Double])
scala> val myMap = sc.objectFile[LookupMapSeq]("file:///data/dir/myMap").collectAsMap()
/* myMap.size is about 150,000 and each Seq[String] is of size 200 */
scala> val combineudf = functions.udf[Seq[Double], Seq[String]] { v1 =>
val wordVec = v1.map(y => myMap.getOrElse(y, Seq.fill(200)(0.0)))
wordVec.foldLeft(Seq.fill(200)(0.0)) { case (acc, list) =>
acc.zipWithIndex.map { case (value, i) => value + list(i) }
}
}
scala> import org.apache.spark.storage.StorageLevel
scala> val df6 = df3.withColumn("sum", combineudf(df3("filtered"))).persist(StorageLevel.DISK_ONLY)
scala> df6.show
+--------+--------------------+--------------------+-------------------
-+
| pmid| filtered| TFIDFvec| sum|
+--------+--------------------+--------------------+--------------------+
|25393341|[retreatment, rec...|[0.0, 26.21009534...|[4.34963607663623...|
|25394466|[lactate, dehydro...|[21.3762879413052...|[-17.550128685500...|
|25394717|[aim, study, inve...|[3.11641169932197...|[-54.981726214632...|
scala> df6.write.parquet("file:///data/dir/df6")
/* This results in out of memory */