在Spark Scala中使用自定义数据帧类时任务不可序列化_Scala_Apache Spark_Task_Implicit_Serializable

在Spark Scala中使用自定义数据帧类时任务不可序列化

scala apache-spark

在Spark Scala中使用自定义数据帧类时任务不可序列化,scala,apache-spark,task,implicit,serializable,Scala,Apache Spark,Task,Implicit,Serializable,我在Scala/Spark（1.5）和齐柏林飞艇上遇到了一个奇怪的问题：如果我运行以下Scala/Spark代码，它将正常运行： // TEST NO PROBLEM SERIALIZATION val rdd = sc.parallelize(Seq(1, 2, 3)) val testList = List[String]("a", "b") rdd.map{a => val aa = testList(0) None} 但是，在按照建议声明自定义数据帧类型之后

我在Scala/Spark（1.5）和齐柏林飞艇上遇到了一个奇怪的问题：

如果我运行以下Scala/Spark代码，它将正常运行：

// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")

rdd.map{a => 
    val aa = testList(0)
    None}

但是，在按照建议声明自定义数据帧类型之后

并以其为例，如下所示：

//READ ALL THE FILES INTO different DF and save into map
import ExtraDataFrameOperations._
val filename = "myInput.csv"

val delimiter =  ","

val colToIgnore = Seq("c_9", "c_10")

val inputICFfolder = "hdfs:///group/project/TestSpark/"

val df = sqlContext.read
            .format("com.databricks.spark.csv")
            .option("header", "true") // Use first line of all files as header
            .option("inferSchema", "false") // Automatically infer data types? => no cause we need to merge all df, with potential null values => keep string only
            .option("delimiter", delimiter)
            .option("charset", "UTF-8")
            .load(inputICFfolder + filename)
            .drop(colToIgnore)//call the customize dataframe

此操作已成功运行

现在，如果我再次运行以下代码（与上面相同）

我收到错误消息：

rdd:org.apache.spark.rdd.rdd[Int]=ParallelCollectionRDD[8]at 并行化at:32 testList:List[String]=List（a，b） org.apache.spark.SparkException:无法在上序列化任务 org.apache.spark.util.ClosureCleaner$.ensureSerializable（ClosureCleaner.scala:304）在 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean（ClosureCleaner.scala:294）在 org.apache.spark.util.ClosureCleaner$.clean（ClosureCleaner.scala:122）位于org.apache.spark.SparkContext.clean（SparkContext.scala:2032） org.apache.spark.rdd.rdd$$anonfun$map$1.apply（rdd.scala:314） ... 原因：java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$ 序列化堆栈：-对象不可序列化（类： $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$ExtraDataFrameOperations$，价值： $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$ExtraDataFrameOperations$@6c7e70e） -字段（类：$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC，名称：ExtraDataFrameOperations$模块，类型：class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$ExtraDataFrameOperations$） -对象（类$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC，$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC@4c6d0802)-字段（类别： $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC，名称：$iw，类型：class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC）

我不明白：

为什么在未对数据帧执行任何操作时发生此错误
为什么“ExtraDataFrameOperations”在以前成功使用时不可序列化

更新：

尝试

@inline val testList = List[String]("a", "b")

没有帮助。

看起来spark试图序列化

testList

周围的所有范围。

尝试内联数据

@inline val testList=List[String]（“a”，“b”）

或使用不同的对象存储传递给驱动程序的函数/数据。

看起来spark试图序列化

testList

周围的所有范围。

尝试内联数据

@inline val testList=List[String]（“a”、“b”）

或使用不同的对象来存储传递给驱动程序的函数/数据。

只需添加“extends Serializable” 这是我的工作

/**
   * A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
   *
   * KafkaProducer is shared within all threads in one executor.
   * Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
   */
 implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {

   class ExceptionRegisteringCallback extends Callback {
     private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)

     override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
       Option(exception) match {
         case a @ Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
         case _ => // do nothing if encountered successful send
       }
     }

     def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
   }

   /**
     * Save to Kafka reusing KafkaProducer from singleton holder.
     * Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
     * exception in the same thread to allow Spark task to fail
     */
   def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
     ds.foreachPartition { records =>
       val callback = new ExceptionRegisteringCallback
       val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)

       records.foreach(record => producer.send(record, callback))

       producer.flush()
       callback.rethrowException()
     }
   }
 }'

只需添加“扩展可序列化” 这是我的工作

/**
   * A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
   *
   * KafkaProducer is shared within all threads in one executor.
   * Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
   */
 implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {

   class ExceptionRegisteringCallback extends Callback {
     private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)

     override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
       Option(exception) match {
         case a @ Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
         case _ => // do nothing if encountered successful send
       }
     }

     def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
   }

   /**
     * Save to Kafka reusing KafkaProducer from singleton holder.
     * Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
     * exception in the same thread to allow Spark task to fail
     */
   def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
     ds.foreachPartition { records =>
       val callback = new ExceptionRegisteringCallback
       val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)

       records.foreach(record => producer.send(record, callback))

       producer.flush()
       callback.rethrowException()
     }
   }
 }'

不幸的是@inline没有帮助，在其他对象中存储函数/数据并不真正适合自定义数据框对象的策略不幸的是@inline没有帮助，在其他对象中存储函数/数据并不真正适合自定义数据框对象的策略

/**
   * A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
   *
   * KafkaProducer is shared within all threads in one executor.
   * Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
   */
 implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {

   class ExceptionRegisteringCallback extends Callback {
     private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)

     override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
       Option(exception) match {
         case a @ Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
         case _ => // do nothing if encountered successful send
       }
     }

     def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
   }

   /**
     * Save to Kafka reusing KafkaProducer from singleton holder.
     * Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
     * exception in the same thread to allow Spark task to fail
     */
   def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
     ds.foreachPartition { records =>
       val callback = new ExceptionRegisteringCallback
       val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)

       records.foreach(record => producer.send(record, callback))

       producer.flush()
       callback.rethrowException()
     }
   }
 }'