Multithreading 在Spark DataFrame foreachPartition()中运行线程
我在Multithreading 在Spark DataFrame foreachPartition()中运行线程,multithreading,scala,apache-spark,apache-spark-sql,Multithreading,Scala,Apache Spark,Apache Spark Sql,我在foreachPartition()中使用了多个线程,这对我来说非常有用,除非底层迭代器是tungstengaggregationiterator。下面是要复制的最小代码段: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.duration.Duration import scala.concurrent.{Await, Future} imp
foreachPartition()
中使用了多个线程,这对我来说非常有用,除非底层迭代器是tungstengaggregationiterator
。下面是要复制的最小代码段:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object Reproduce extends App {
val sc = new SparkContext("local", "reproduce")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
df.foreachPartition { iterator =>
val f = Future(iterator.toVector)
Await.result(f, Duration.Inf)
}
}
当我运行此程序时,我得到:
java.lang.NullPointerException
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
我相信我确实理解了为什么会发生这种情况-tungstengaggregationiterator
使用ThreadLocal
变量,当从从Spark获得迭代器的原始线程以外的线程调用时,该变量返回null
。通过检查代码,这在最近的Spark版本之间似乎没有什么不同
然而,据我所知,这个限制是针对TungstenAggregationIterator
的,并且没有文档记录
有没有办法克服TungstenAggregationIterator
的这个限制?有相关文件吗?我有一个解决方法,但它非常粗糙,不必要地降低了运行时性能