Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala spark partition.toList失败_Scala_Apache Spark - Fatal编程技术网

Scala spark partition.toList失败

Scala spark partition.toList失败,scala,apache-spark,Scala,Apache Spark,我想对一个分区中的多个元素进行分组,然后对每个分区中分组的元素执行一些操作。但是我发现从分区到列表的转换失败了。请参见以下示例: import scala.collection.mutable.ArrayBuffer val rdd = sc.parallelize(Seq("a","b","c","d","e"), 2) val mapped = rdd.mapPartitions( partition => { val total = partition.size

我想对一个分区中的多个元素进行分组,然后对每个分区中分组的元素执行一些操作。但是我发现从分区到列表的转换失败了。请参见以下示例:

import scala.collection.mutable.ArrayBuffer
val rdd = sc.parallelize(Seq("a","b","c","d","e"), 2)
val mapped = rdd.mapPartitions( partition =>
   {
      val total = partition.size
          var first = partition.toList match
            {
             case Nil => "EMPTYLIST"
             case _ =>  partition.toList.head  
            }

  var finalResult = ArrayBuffer[String]()
  finalResult += "1:"+first;
  finalResult += "2:"+first;
  finalResult += "3:"+first;

  finalResult.iterator
})

mapped.collect()
结果:

数组[字符串]=数组(1:EMPTYLIST,2:EMPTYLIST,3:EMPTYLIST, 1:EMPTYLIST,2:EMPTYLIST,3:EMPTYLIST)

为什么partition.toList总是空的?

分区是一个迭代器,大小计数会消耗它,因此在您将其转换为列表时,它已经是空的;要多次浏览分区,可以在开始时将分区转换为列表,然后在列表中执行稍后需要的操作:

val mapped = rdd.mapPartitions( partition =>
    {
        val partitionList = partition.toList
        val total = partitionList.size
        val first = partitionList match
            {
              case Nil => "EMPTYLIST"
              case _ =>  partitionList.head  
            }

        var finalResult = ArrayBuffer[String]()
        finalResult += "1:"+first;
        finalResult += "2:"+first;
        finalResult += "3:"+first;

        finalResult.iterator
    })

mapped.collect
// res7: Array[String] = Array(1:a, 2:a, 3:a, 1:c, 2:c, 3:c)
分区是一个迭代器,大小计数消耗它,所以在您将其转换为列表时,它已经是空的;要多次浏览分区,可以在开始时将分区转换为列表,然后在列表中执行稍后需要的操作:

val mapped = rdd.mapPartitions( partition =>
    {
        val partitionList = partition.toList
        val total = partitionList.size
        val first = partitionList match
            {
              case Nil => "EMPTYLIST"
              case _ =>  partitionList.head  
            }

        var finalResult = ArrayBuffer[String]()
        finalResult += "1:"+first;
        finalResult += "2:"+first;
        finalResult += "3:"+first;

        finalResult.iterator
    })

mapped.collect
// res7: Array[String] = Array(1:a, 2:a, 3:a, 1:c, 2:c, 3:c)

partition.toList是否将数据从工作者合并到驱动程序?我不这么认为。它应该仍然停留在从迭代器转换为列表的每个工作进程上。还可以看到一些讨论。然后是有趣的。如果我运行“partition.take(10)”,它会从每个worker获得10个元素吗?它会。您可以使用以下命令测试它
rdd.mapPartitions(p=>p.take(2))。collect
返回样本数据的
数组(a,b,c,d)
。和
rdd.mapPartitions(p=>p.take(1)).collect
返回
Array(a,c)
。partition.toList是否将数据从工作者合并到驱动程序?我不这么认为。它应该仍然停留在从迭代器转换为列表的每个工作进程上。还可以看到一些讨论。然后是有趣的。如果我运行“partition.take(10)”,它会从每个worker获得10个元素吗?它会。您可以使用以下命令测试它
rdd.mapPartitions(p=>p.take(2))。collect
返回样本数据的
数组(a,b,c,d)
。和
rdd.mapPartitions(p=>p.take(1)).collect
返回
数组(a,c)