Apache spark 为什么将RDD更改为数据帧时会发生Spark not serializable异常？_Apache Spark_Apache Spark Sql_Spark Structured Streaming

Apache spark 为什么将RDD更改为数据帧时会发生Spark not serializable异常？

apache-spark

Apache spark 为什么将RDD更改为数据帧时会发生Spark not serializable异常？,apache-spark,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Spark Sql,Spark Structured Streaming,我正在使用结构化流媒体和以下代码作品 val j = new Jedis() // an redis client which is not serializable. xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => { j.xtrim(...)... // call function of Jedis here batchDF.rdd.mapPartitions(...) }} 但下面的代码

我正在使用结构化流媒体和以下代码作品

val j = new Jedis() // an redis client which is not serializable.

xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
  j.xtrim(...)... // call function of Jedis here
  batchDF.rdd.mapPartitions(...)
}}

但下面的代码引发了一个异常，

对象不可序列化（类：redis.clients.jedis.jedis，值：redis.clients.jedis）。Jedis@a8e0378)

代码只有一个位置更改（将RDD更改为DataFrame）：

我的

Jedis

代码应该在驱动程序上执行，永远不要到达执行器。我想Spark RDD和DataFrame应该有类似的API？为什么会发生这种情况

我使用ctrl键进入较低级别的代码。

batchDF.mapPartitions

转到

  @Experimental
  @InterfaceStability.Evolving
  def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] = 
  {
    new Dataset[U](
      sparkSession,
      MapPartitions[T, U](func, logicalPlan),
      implicitly[Encoder[U]])
  }

    def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

然后

batchDF.rdd.mapPartitions

转到

  @Experimental
  @InterfaceStability.Evolving
  def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] = 
  {
    new Dataset[U](
      sparkSession,
      MapPartitions[T, U](func, logicalPlan),
      implicitly[Encoder[U]])
  }

    def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

我的Spark版本是2.4.3

下面是我最简单的代码版本，我发现了一些其他的东西

val j = new Jedis() // an redis client which is not serializable.

xx.writeStream.foreachBatch{(batchDF: DataFrame, batchId: Long) => {
  j.xtrim(...)... // call function of Jedis here
  batchDF.mapPartitions(x => {
    val arr = x.grouped(2).toArray // this line matters
  })
  // only change is change batchDF.rdd to batchDF
}}

看

它在内部调用函数的rdd.mapPartitions

     /**
       * Returns a new RDD by applying a function to each partition of this DataFrame.
       * @group rdd
       * @since 1.3.0
       */
      def mapPartitions[R: ClassTag](f: Iterator[Row] => Iterator[R]): RDD[R] = {
        rdd.mapPartitions(f)
      }

在其他一些地方，你可能犯了错误，这是没有区别的

好吧，理想情况下应该是这样

 batchDF.mapPartitions { yourparition =>
// better to create a JedisPool and take object rather than new Jedis
 val j = new Jedis() 
val result = yourparition.map {
// do some process here
}

j.close // release and take care of connections/ resources here
result
}
}

数据集调用

val deserialized=CatalystSerde.deserialize[T]（子项）

而RDD-

ClosureCleaner.clean（f，checkSerializable）

可能涉及一些差异downstream@morsik我认为你的答案应该是正确的。你能提供一个完整的堆栈跟踪和/或可复制的示例吗。我试着提出一个简化的版本，但没有任何例外。你100%确定你没有在

地图分区中使用绝地物体吗？@morsik我发布了最简单的代码版本。在这个过程中，我发现这里真正重要的是val arr=x.grouped（2）。toArray
此代码在batchDF.rdd.mapPartitions
中工作，但在batchDF.mapPartitions
中不工作，创建任意spark会话并添加此代码，我认为您可以重现错误。我相信这个密码对绝地武士没有任何作用object@morsik将toArray
更改为toString
会发生相同的错误。我想问题可以改为“为什么数据帧的mapPartition中的操作与驱动程序代码有关联”？我想在spark驱动程序上操作val j=new Jedis
，而不是在executor中。我的版本代码中有任何潜在的风险吗？@Ram Ghadiyaram您的参考是Spark 1。6@RamGhadiyaram我发布了我的代码跳转到，我的版本是2.4.3，我不知道它在最新版本中是什么。如果您的代码是1.6，那么最新的代码应该与我的代码更一致。顺便说一下，我非常确定两个代码中唯一的区别是batchDF.rdd.mapPartitions
和batchDF.mapPartitions