Scala 如何正确使用cache（）？_Scala_Apache Spark

Scala 如何正确使用cache（）？

scala apache-spark

Scala 如何正确使用cache（）？,scala,apache-spark,Scala,Apache Spark,我正在使用Spark 1.1.0，并尝试将一个图形加载到GraphX中。我的部分代码如下所示： val distinct=context.union（r1，r2）.distinct； distinct.cache（） val zipped=distinct.zipWithUniqueId 压缩缓存非持久性（false）当我在集群上执行它时，执行的第一个阶段是：在测试时不同。scala:72 但在该操作完成后，我在Spark UI的“存储”选项卡中看不到条目。下一阶段是： zipWit

我正在使用Spark 1.1.0，并尝试将一个图形加载到GraphX中。我的部分代码如下所示：

val distinct=context.union（r1，r2）.distinct；
distinct.cache（）
val zipped=distinct.zipWithUniqueId
压缩缓存
非持久性（false）

当我在集群上执行它时，执行的第一个阶段是：

在测试时不同。scala:72

但在该操作完成后，我在Spark UI的“存储”选项卡中看不到条目。下一阶段是：

zipWithUniqueId在测试中。scala:78

但从那以后，它又开始了：

在测试时不同。scala:72

这个结果不应该被缓存吗？如果RDD只使用一次，缓存它是否有用

编辑：我忘了提到，在测试时，我在

zipWithUniqueId上也得到了获取失败
获取问题的可能解决方案
本文介绍了Spark版本1.1.0中可能存在的缺陷的可能解决方案
还有spark用户邮件列表中Andrew Ash的可能解决方案：
在1.1中，目前似乎有3种情况会导致FetchFailures失败：
1） 执行器上的长GCs（长于spark.core.connection.ack.wait.timeout默认值60秒）
2） 打开的文件太多（达到ulimit-n上的内核限制）
3） 在那张票上发现了一些未确定的问题

缓存
将在第一次评估RDD时应用。这意味着，为了有效，缓存应该在生成RDD的某个操作之前，您将多次使用该RDD。
考虑到缓存
应用于RDD评估，如果您有一个只执行一次的线性RDD沿袭，缓存将只占用内存，而不会带来任何优势
因此，如果您的管道是：
val distinct = context.union(r1, r2).distinct;
val zipped = distinct.zipWithUniqueId
zipped.cache

在distinct
和zipped
之间使用cache
将没有任何用处，除非您需要再次访问distinct
数据。如果你在之后立即取消它的持久性，那我就不这么想了
简而言之，如果评估的RDD将被多次使用，则仅使用.cache
。（如迭代算法、查找等）
缓存spark外壳示例：
val rdd = sc.makeRDD( 1 to 1000)
val cached = rdd.cache // at this point, nothing in the console





cached.count // at this point, you can see cached in the console
res0: Long = 1000

val zipped = cached.zipWithUniqueId
val zipcache = zipped.cache // again nothing new on the UI
val zipcache.first // first is an action and will trigger RDD evaluation

cached.unpersist(blocking=true) // force immediate unpersist