Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Multithreading 在Spark DataFrame foreachPartition()中运行线程_Multithreading_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Multithreading 在Spark DataFrame foreachPartition()中运行线程

Multithreading 在Spark DataFrame foreachPartition()中运行线程,multithreading,scala,apache-spark,apache-spark-sql,Multithreading,Scala,Apache Spark,Apache Spark Sql,我在foreachPartition()中使用了多个线程,这对我来说非常有用,除非底层迭代器是tungstengaggregationiterator。下面是要复制的最小代码段: import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.duration.Duration import scala.concurrent.{Await, Future} imp

我在
foreachPartition()
中使用了多个线程,这对我来说非常有用,除非底层迭代器是
tungstengaggregationiterator
。下面是要复制的最小代码段:

    import scala.concurrent.ExecutionContext.Implicits.global
    import scala.concurrent.duration.Duration
    import scala.concurrent.{Await, Future}

    import org.apache.spark.SparkContext
    import org.apache.spark.sql.SQLContext

    object Reproduce extends App {

      val sc = new SparkContext("local", "reproduce")
      val sqlContext = new SQLContext(sc)

      import sqlContext.implicits._

      val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()

      df.foreachPartition { iterator =>
        val f = Future(iterator.toVector)
        Await.result(f, Duration.Inf)
      }
    }
当我运行此程序时,我得到:

    java.lang.NullPointerException
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
        at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
我相信我确实理解了为什么会发生这种情况-
tungstengaggregationiterator
使用
ThreadLocal
变量,当从从Spark获得迭代器的原始线程以外的线程调用时,该变量返回
null
。通过检查代码,这在最近的Spark版本之间似乎没有什么不同

然而,据我所知,这个限制是针对
TungstenAggregationIterator
的,并且没有文档记录

有没有办法克服
TungstenAggregationIterator
的这个限制?有相关文件吗?我有一个解决方法,但它非常粗糙,不必要地降低了运行时性能