Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark缓存对优化逻辑计划的影响_Apache Spark_Caching - Fatal编程技术网

Apache spark Spark缓存对优化逻辑计划的影响

Apache spark Spark缓存对优化逻辑计划的影响,apache-spark,caching,Apache Spark,Caching,我看到了这个问题和极好的答案 要点是: val df = spark.range(100) df.join(df, Seq("id")).filter('id <20).explain(true) val df=spark.range(100) join(df,Seq(“id”)).filter('id这里我认为您在实验中遇到了一个bug 如果在新的spark shell中运行以下操作: val df = spark.range(100) df.join(df, Se

我看到了这个问题和极好的答案

要点是:

val df = spark.range(100)
df.join(df, Seq("id")).filter('id <20).explain(true)
val df=spark.range(100)

join(df,Seq(“id”)).filter('id这里我认为您在实验中遇到了一个bug

如果在新的spark shell中运行以下操作:

val df = spark.range(100)
df.join(df, Seq("id")).filter('id <20).cache.explain(true)
val df=spark.range(100)

df.join(df,Seq(“id”)).filter('id您好!您使用的是Spark的哪个版本?我在我的本地Spark外壳上有一个不同的OLP,带有Spark
2.4.4
245,在Databricks上。稍后将在3上尝试@BlueSheepTokenI回答它,希望这一切都有意义!在Spark3上,这是一个有趣的实验(完成了),但理解下推谓词会增加噪音。对我来说,它看起来像一个bug。这是你的结论吗?一定是。干杯。我重新启动了集群,但肯定出错了。你应该尝试简单地取消持久化:
df.join(df,Seq(“id”).unpersist()
然后
df.join(df,Seq(“id”).filter('id加上一个来取消以前缓存的内容的持久性。答案本身在我看来是值得更新的。我想我还有另外一点要说,但我正在进行中,但这是好东西
df.join(df, Seq("id")).cache.filter('id <20).explain(true)
== Optimized Logical Plan ==
Filter (id#16L < 20)
+- InMemoryRelation [id#16L], StorageLevel(disk, memory, deserialized, 1 replicas)
      +- *(2) Project [id#16L]
         +- *(2) BroadcastHashJoin [id#16L], [id#21L], Inner, BuildRight
            :- *(2) Range (0, 100, step=1, splits=8)
            +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#112]
               +- *(1) Range (0, 100, step=1, splits=8)
df.join(df, Seq("id")).filter('id <20).cache.explain(true)
== Optimized Logical Plan ==
InMemoryRelation [id#16L], StorageLevel(disk, memory, deserialized, 1 replicas)
   +- *(1) Filter (id#16L < 20)
      +- *(1) InMemoryTableScan [id#16L], [(id#16L < 20)]
            +- InMemoryRelation [id#16L], StorageLevel(disk, memory, deserialized, 1 replicas)
                  +- *(2) Project [id#16L]
                     +- *(2) BroadcastHashJoin [id#16L], [id#21L], Inner, BuildRight
                        :- *(2) Range (0, 100, step=1, splits=8)
                        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#112]
                           +- *(1) Range (0, 100, step=1, splits=8)
val df = spark.range(100)
df.join(df, Seq("id")).filter('id <20).cache.explain(true)
== Optimized Logical Plan ==
InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 replicas)
   +- *(2) Project [id#0L]
      +- *(2) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight
         :- *(2) Filter (id#0L < 20)
         :  +- *(2) Range (0, 100, step=1, splits=12)
         +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
            +- *(1) Filter (id#2L < 20)
               +- *(1) Range (0, 100, step=1, splits=12)
val df = spark.range(100)
df.join(df, Seq("id")).cache.filter('id <20).explain(true)
df.join(df, Seq("id")).filter('id <20).cache.explain(true)
== Optimized Logical Plan ==
InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 replicas)
   +- *(1) Filter (id#0L < 20)
      +- *(1) InMemoryTableScan [id#0L], [(id#0L < 20)]
            +- InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 replicas)
                  +- *(2) Project [id#0L]
                     +- *(2) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight
                        :- *(2) Range (0, 100, step=1, splits=12)
                        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
                           +- *(1) Range (0, 100, step=1, splits=12)