Scala 在转换中使用函数是否会导致不可序列化的异常?

Scala 在转换中使用函数是否会导致不可序列化的异常?,scala,function,apache-spark,matrix,serializable,Scala,Function,Apache Spark,Matrix,Serializable,我有一个Breeze DenseMatrix,我找到每行的平均值和每行的平均值,然后将它们放在另一个DenseMatrix,每列一个。但是我得到了任务不可序列化的异常。我知道sc不是Serializable,但我认为例外是因为我在安全区的转换中调用函数 我说得对吗?如果没有任何功能,怎么可能做到这一点呢?任何帮助都会很好 代码: 例外情况: org.apache.spark.SparkException: Task not serializable at org.apache.s

我有一个
Breeze DenseMatrix
,我找到每行的
平均值和每行的
平均值
,然后将它们放在另一个
DenseMatrix
,每列一个。但是我得到了任务不可序列化的异常。我知道
sc
不是
Serializable
,但我认为例外是因为我在安全区的转换中调用函数

我说得对吗?如果没有任何功能,怎么可能做到这一点呢?任何帮助都会很好

代码:

例外情况:

org.apache.spark.SparkException: Task not serializable
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
        at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
        at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.map(RDD.scala:369)
        at ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:85)
        at ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:82)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
        at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
        at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
        at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748) Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext Serialization stack:
        - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@6eee7027)
        - field (class: ScalaApps.MotitorDetection$MonDetect, name: sc, type: class org.apache.spark.SparkContext)
        - object (class ScalaApps.MotitorDetection$MonDetect, MonDetect())
        - field (class: ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1, name: $outer, type: class ScalaApps.MotitorDetection$MonDetect)
        - object (class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1, <function2>)
        - field (class: ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2, name: $outer, type: class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1)
        - object (class ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2, <function1>)
        at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
        ... 28 more
org.apache.spark.SparkException:任务不可序列化
位于org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
位于org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
位于org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
位于org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
位于org.apache.spark.rdd.rdd$$anonfun$map$1.apply(rdd.scala:370)
位于org.apache.spark.rdd.rdd$$anonfun$map$1.apply(rdd.scala:369)
位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
位于org.apache.spark.rdd.rdd.withScope(rdd.scala:362)
位于org.apache.spark.rdd.rdd.map(rdd.scala:369)
在ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:85)
在ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1.apply(MotitorDetection.scala:82)
在org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
在org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
在org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
位于org.apache.spark.streaming.dstream.dstream.createRDDWithLocalProperties(dstream.scala:416)
在org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
在org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply上(ForEachDStream.scala:50)
在org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply上(ForEachDStream.scala:50)
在scala.util.Try$.apply(Try.scala:192)
位于org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
在org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
位于org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
位于org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
在scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)中
位于org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
在java.lang.Thread.run(Thread.java:748)处,由以下原因引起:java.io.notserializableeexception:org.apache.spark.SparkContext序列化堆栈:
-对象不可序列化(类:org.apache.spark.SparkContext,值:org.apache.spark)。SparkContext@6eee7027)
-字段(类:ScalaApps.MotitorDetection$MonDetect,名称:sc,类型:class org.apache.spark.SparkContext)
-对象(类ScalaApps.MotitorDetection$MonDetect,MonDetect())
-字段(类:ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1,名称:$outer,类型:类ScalaApps.MotitorDetection$MonDetect)
-对象(类ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1,)
-字段(类:ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2,名称:$outer,类型:类ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1)
-对象(类ScalaApps.MotitorDetection$MonDetect$$anonfun$SafeZones$1$$anonfun$2,)
位于org.apache.spark.serializer.SerializationDebugger$.ImproveeException(SerializationDebugger.scala:40)
位于org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
位于org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
位于org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 28多

findMean方法是一种对象检测方法。对象
MotitorDetection
具有板载
SparkContext
,不可序列化。因此,
rdd.map
中使用的任务是不可序列化的

将所有与矩阵相关的函数移到一个单独的可序列化对象中,
MatrixUtils
,例如:

object MatrixUtils {
  def findMean(a: BDM[Double]): BDV[Double] = {
    var c = mean(a(*, ::))
    c
  }

  def toMatrix(x: BDV[Double], y: BDV[Double], C: Int): BDM[Double]={
    val m = BDM.zeros[Double](C,2)
    m(::, 0) := x
    m(::, 1) := y
    m
  }

  ...
}
然后只使用
rdd.map(…)
中的那些方法:


它不起作用,但我想知道,仅仅通过转换进行相同的计算是否也会导致异常?@mkey您还去掉了
计数器
,并确保所有这些对象都不包含在某个(合成生成的)对象(repl?)中?。
object MatrixUtils {
  def findMean(a: BDM[Double]): BDV[Double] = {
    var c = mean(a(*, ::))
    c
  }

  def toMatrix(x: BDV[Double], y: BDV[Double], C: Int): BDM[Double]={
    val m = BDM.zeros[Double](C,2)
    m(::, 0) := x
    m(::, 1) := y
    m
  }

  ...
}
object MotitorDetection {
  val sc = ...

  def SafeZones(stream: DStream[(Int, BDM[Double])]){
    import MatrixUtils._

    ... = rdd.map( ... )

  }
}