Java 如果在Spark中缓存相同的RDD两次会发生什么_Java_Caching_Apache Spark_Rdd

Java 如果在Spark中缓存相同的RDD两次会发生什么

java caching apache-spark

Java 如果在Spark中缓存相同的RDD两次会发生什么,java,caching,apache-spark,rdd,Java,Caching,Apache Spark,Rdd,我正在构建一个通用函数，它接收RDD并对其进行一些计算。因为我在输入RDD上运行了多个计算，所以我想缓存它。例如： public JavaRDD<String> foo(JavaRDD<String> r) { r.cache(); JavaRDD t1 = r... //Some calculations JavaRDD t2 = r... //Other calculations return t1.union(t2); } publ

我正在构建一个通用函数，它接收RDD并对其进行一些计算。因为我在输入RDD上运行了多个计算，所以我想缓存它。例如：

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

publicjavarddfoo（javarddr）{
r、 缓存（）；
JavaRDD t1=r..//一些计算
JavaRDD t2=r..//其他计算
返回t1.联合（t2）；
}

我的问题是，既然

是给我的，它可能已经被缓存，也可能还没有被缓存。如果它被缓存，我再次调用它的缓存，spark是否会创建一个新的缓存层，这意味着在计算

t1

和

t2

时，缓存中将有两个

实例？或者will spark知道缓存了

，并将忽略它？

什么都没有。如果对缓存的RDD调用
cache
，则不会发生任何事情，RDD将被缓存（一次）。与许多其他转换一样，缓存也是懒惰的：

调用
cache
时，RDD的
storageLevel
设置为
MEMORY\u ONLY

再次调用
cache
时，它被设置为相同的值（无更改）

经过评估，当底层RDD具体化时，Spark将检查RDD的
storageLevel
，如果它需要缓存，它将缓存它

所以你很安全
在我的集群上测试一下，Zohar是对的，什么都没有发生，它只缓存RDD一次。我认为，原因是每个RDD在内部都有一个
id
，spark将使用
id
标记RDD是否已缓存。因此，多次缓存一个RDD将毫无用处
下面是我的代码和屏幕截图：

更新[根据需要添加代码]

关于你的便条，这是我想知道了一段时间，没有发现任何记录。如果您的答案是正确的，并且调用缓存只会更改RDD对象中的一个标志，那么为什么我不能使用相同的对象呢？我将在注释中稍微解释一下我的问题。如果有RDD被调用为
orig
，并且有人在外面，那么函数就调用了
r=orig.cache（）然后调用函数中的Icached=r.cache（）。如果你说的是真的，我会在缓存中存储两次相同的数据，一次作为r ，一次作为cached ，不是吗？你是对的，我错了。无需使用缓存返回值，它返回此 -相同的精确RDD；rdd1.cache（）；rdd1=rdd1.map（…）；rdd1.cache（）；rdd1.count（）；。`它将只缓存一次，还是将覆盖以前的缓存，因为在同一rdd上发生了一些转换？谢谢，您知道@TzachZohar关于cache（）的说法是否正确吗。如果是这样，您是否需要编写raw\u file=raw\u file.cache（）？@RoeeGavirel cache只是RDD的一种方法，它不返回任何内容，文档如下： ### cache and count, then will show the storage info on WEB UI raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\ .setName("raw_file")\ .cache() raw_file.count() ### try to cache and count again, then take a look at the WEB UI, nothing changes raw_file.cache() raw_file.count() ### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still ### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on ### the document even then source code raw_file.setName("raw_file_2") raw_file.cache().count()