Performance 当rdd项很大时，为什么rdd.map（identity.cache）会变慢？_Performance_Caching_Apache Spark

Performance 当rdd项很大时，为什么rdd.map（identity.cache）会变慢？

performance caching apache-spark

Performance 当rdd项很大时，为什么rdd.map（identity.cache）会变慢？,performance,caching,apache-spark,Performance,Caching,Apache Spark,我发现在rdd上使用.map（identity）.cache时，如果项目很大，速度会非常慢。而在其他方面，这几乎是瞬间的注意：这可能与有关，但这里我提供了一个非常精确的示例（可以直接在spark shell中执行）： //配置执行时间的简单函数（毫秒） def配置文件[R]（代码：=>R）：R={ val t=System.nanoTime val out=代码 println（s“时间=${（System.nanoTime-t）/1000000}ms”）出来 } //创建一些大尺寸的项目

我发现在rdd上使用

.map（identity）.cache

时，如果项目很大，速度会非常慢。而在其他方面，这几乎是瞬间的

注意：这可能与有关，但这里我提供了一个非常精确的示例（可以直接在spark shell中执行）：

//配置执行时间的简单函数（毫秒）
def配置文件[R]（代码：=>R）：R={
val t=System.nanoTime
val out=代码
println（s“时间=${（System.nanoTime-t）/1000000}ms”）
出来
}
//创建一些大尺寸的项目
def bigContent（）=（1到1000）.map（i=>（1到1000）.map（j=>（i，j））.toMap）
//创建rdd
val n=1000//rdd的大小
val rdd=sc.parallelize（1到n）.map（k=>bigContent（））.cache
rdd.count//触发缓存
//轮廓
配置文件（rdd.count）//大约12毫秒
profile（rdd.map（identity.count）//相同
配置文件（rdd.cache.count）//相同
配置文件（rdd.map（identity.cache.count）//5700毫秒！！！

我首先期望是时候创建一个新的rdd（容器）。但如果我使用的rdd大小相同，但内容很少，那么执行时间只有一点差别：

val rdd=parallelize（1到n）.cache
rdd.count
配置文件（rdd.count）//大约9毫秒
profile（rdd.map（identity.count）//相同
配置文件（rdd.cache.count）//相同
配置文件（rdd.map（identity.cache.count）//15毫秒

因此，看起来缓存实际上是在复制数据。我认为序列化它可能也会浪费时间，但我检查了缓存是否与默认的仅内存的持久性一起使用：

rdd.getStorageLevel==StorageLevel.MEMORY\u ONLY//true

=>那么，缓存是复制数据，还是其他什么

这确实是我的应用程序的一个主要限制，因为我从一个类似于

rdd=rdd.map（f:Item=>Item.cache）的设计开始，它可以与许多这样的函数一起使用，这些函数以任意顺序应用（我无法事先确定的顺序）
我正在使用Spark 1.6.0
编辑
当我查看spark ui->阶段选项卡->最后一个阶段（即4）时，所有任务的数据几乎相同：

持续时间=3s（下降到3s，但仍然超过2.9:-\）
调度程序10ms
任务反序列化20ms
gc 0.1s（所有任务都有，但为什么会触发gc？？）
结果序列化0毫秒
获得0毫秒的结果
峰值执行内存0.0B
输入大小7.0MB/125
没有错误
在慢速缓存期间运行org.apache.spark.executor.GrossGrainedExecutorBackend
的进程的jstack
显示以下内容：
"Executor task launch worker-4" #76 daemon prio=5 os_prio=0 tid=0x00000000030a4800 nid=0xdfb runnable [0x00007fa5f28dd000]
   java.lang.Thread.State: RUNNABLE
  at java.util.IdentityHashMap.resize(IdentityHashMap.java:481)
  at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
  at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:176)
  at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:251)
  at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:211)
  at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:203)
  at org.apache.spark.util.SizeEstimator$$anonfun$sampleArray$1.apply$mcVI$sp(SizeEstimator.scala:284)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
  at org.apache.spark.util.SizeEstimator$.sampleArray(SizeEstimator.scala:276)
  at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:260)
  at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:211)
  at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:203)
  at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:70)
  at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
  at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
  at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
  at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
  at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
  at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
  at org.apache.spark.scheduler.Task.run(Task.scala:89)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)


"Executor task launch worker-5" #77 daemon prio=5 os_prio=0 tid=0x00007fa6218a9800 nid=0xdfc runnable [0x00007fa5f34e7000]
   java.lang.Thread.State: RUNNABLE
  at java.util.IdentityHashMap.put(IdentityHashMap.java:428)
  at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:176)
  at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:224)
  at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:223)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:223)
  at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:203)
  at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:70)
  at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
  at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
  at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
  at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
  at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
  at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
  at org.apache.spark.scheduler.Task.run(Task.scala:89)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

缓存表面上已经存在于内存中的东西的主要成本之一是合理的，因为对未知对象进行适当的大小估计可能相当困难；如果您查看该方法，您可以看到它严重依赖于反射，调用getClassInfo
访问运行时类型信息；不仅遍历完整的对象层次结构，而且根据IdentityHashMap
检查每个嵌套成员，以检测哪些引用引用了相同的具体对象实例，因此堆栈跟踪显示了这些IdentityHashMap操作中的大量时间
对于示例对象，基本上每个项都是从包装整数到包装整数的映射列表；据推测，Scala的内部映射实现也包含一个数组，这解释了VisitingleObject->List.foreach->VisitingleObject->VisitingleObject调用层次结构。在任何情况下，在这种情况下都有许多内部对象需要访问，SizeEstimators为每个采样的对象设置了一个新的IdentityHashMap
在您测量的情况下：
profile( rdd.cache.count )

因为RDD已经被成功缓存，所以这并不算是在执行缓存逻辑，所以Spark足够聪明，不会重新运行缓存逻辑。通过直接分析新的RDD创建和缓存，您实际上可以独立于额外的“映射（标识）”转换来隔离缓存逻辑的确切成本；以下是我的Spark课程，从您的最后几行开始：
scala> profile( rdd.count )
time = 91ms
res1: Long = 1000

scala> profile( rdd.map(identity).count )
time = 112ms
res2: Long = 1000

scala> profile( rdd.cache.count )
time = 59ms
res3: Long = 1000

scala> profile( rdd.map(identity).cache.count )
time = 6564ms                                                                   
res4: Long = 1000

scala> profile( sc.parallelize(1 to n).map( k => bigContent() ).count )
time = 14990ms                                                                  
res5: Long = 1000

scala> profile( sc.parallelize(1 to n).map( k => bigContent() ).cache.count )
time = 22229ms                                                                  
res6: Long = 1000

scala> profile( sc.parallelize(1 to n).map( k => bigContent() ).map(identity).cache.count )
time = 21922ms                                                                  
res7: Long = 1000

因此，您可以看到，速度慢并不是因为您本身运行了map
转换，而是在这种情况下，当每个对象都有大约1000000到10000000个内部对象时，~6s似乎是计算1000个对象缓存逻辑的基本成本（取决于Map实现的布局方式；例如，顶部堆栈跟踪中的extravisitArray
嵌套提示HashMap impl具有嵌套数组，这对于每个hashtable条目内的典型密集线性探测数据结构是有意义的）
对于您的具体用例，如果可能的话，您应该选择延迟缓存，因为缓存中间结果会带来开销，如果您不打算将中间结果重新用于许多单独的下游转换，这不是一个好的折衷办法如果要将一个RDD分支到多个不同的下游转换中，那么如果原始转换非常昂贵，那么您可能确实需要缓存步骤
解决方法是尝试使用更适合于常量时间计算的内部数据结构（例如，原语数组），这样可以避免迭代大量包装器对象并依靠SizeEstimator中的反射来节省大量成本
我尝试了Array[Array[Int]]之类的方法，尽管仍然存在非零开销，但对于类似的数据大小，这一方法要好10倍：
scala> def bigContent2() = (1 to 1000).map( i => (1 to 1000).toArray ).toArray
bigContent2: ()Array[Array[Int]]

scala> val rdd = sc.parallelize(1 to n).map( k => bigContent2() ).cache
rdd: org.apache.spark.rdd.RDD[Array[Array[Int]]] = MapPartitionsRDD[23] at map at <console>:28

scala> rdd.count // to trigger caching
res16: Long = 1000                                                              

scala> 

scala> // profiling

scala> profile( rdd.count )
time = 29ms
res17: Long = 1000

scala> profile( rdd.map(identity).count )
time = 42ms
res18: Long = 1000

scala> profile( rdd.cache.count )
time = 34ms
res19: Long = 1000

scala> profile( rdd.map(identity).cache.count )
time = 763ms                                                                    
res20: Long = 1000

如注释部分中所讨论的，您也可以考虑使用MexyyOyLySurthSturaGeLeVE，只要有一个高效的序列化程序，它就可以非常适合POSS。
scala> def bigContent3() = (1 to 1000).map( i => (1 to 1000).toArray )
bigContent3: ()scala.collection.immutable.IndexedSeq[Array[Int]]

scala> val rdd = sc.parallelize(1 to n).map( k => bigContent3() ).cache
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.IndexedSeq[Array[Int]]] = MapPartitionsRDD[27] at map at <console>:28

scala> rdd.count // to trigger caching
res21: Long = 1000                                                              

scala> 

scala> // profiling

scala> profile( rdd.count )
time = 27ms
res22: Long = 1000

scala> profile( rdd.map(identity).count )
time = 39ms
res23: Long = 1000

scala> profile( rdd.cache.count )
time = 37ms
res24: Long = 1000

scala> profile( rdd.map(identity).cache.count )
time = 2781ms                                                                   
res25: Long = 1000

import org.apache.spark.storage.StorageLevel
profile( rdd.map(identity).persist(StorageLevel.MEMORY_ONLY_SER).count )

scala> profile( rdd.map(identity).persist(StorageLevel.MEMORY_ONLY_SER).count )
time = 6709ms                                                                   
res19: Long = 1000

scala> profile( rdd.map(identity).cache.count )
time = 6126ms                                                                   
res20: Long = 1000

scala> profile( rdd.map(identity).persist(StorageLevel.MEMORY_ONLY).count )
time = 6214ms                                                                   
res21: Long = 1000

scala> profile( rdd.map(identity).persist(StorageLevel.MEMORY_ONLY_SER).count )
time = 500ms
res18: Long = 1000

scala> profile( rdd.map(identity).cache.count )
time = 5353ms                                                                   
res19: Long = 1000

scala> profile( rdd.map(identity).persist(StorageLevel.MEMORY_ONLY).count )
time = 5927ms                                                                   
res20: Long = 1000