Apache spark 火花RDD操作，如顶部返回较小的RDD_Apache Spark_Rdd

Apache spark 火花RDD操作，如顶部返回较小的RDD

apache-spark

Apache spark 火花RDD操作，如顶部返回较小的RDD,apache-spark,rdd,Apache Spark,Rdd,我正在寻找一个Spark RDD操作，如top或takeOrdered，但它返回另一个RDD，而不是数组，也就是说，不将完整结果收集到RAM 它可以是一系列操作，但理想情况下，不需要任何步骤就可以尝试将完整结果收集到单个节点的内存中。请查看或者使用aggregate和BoundedPriorityQueue假设您想要RDD的前50% def top50(rdd: RDD[(Double, String)]) = { val sorted = rdd.sortByKey(ascending

我正在寻找一个Spark RDD操作，如

top

或

takeOrdered

，但它返回另一个RDD，而不是数组，也就是说，不将完整结果收集到RAM

它可以是一系列操作，但理想情况下，不需要任何步骤就可以尝试将完整结果收集到单个节点的内存中。

请查看

或者使用

aggregate

和

BoundedPriorityQueue

假设您想要RDD的前50%

def top50(rdd: RDD[(Double, String)]) = {
  val sorted = rdd.sortByKey(ascending = false)
  val partitions = sorted.partitions.size
  // Throw away the contents of the lower partitions.
  sorted.mapPartitionsWithIndex { (pid, it) =>
    if (pid <= partitions / 2) it else Nil
  }
}

DEFTOP50（rdd:rdd[（双精度，字符串）]）={
val sorted=rdd.sortByKey（升序=false）
val partitions=sorted.partitions.size
//扔掉下面分区的内容。
sorted.mapPartitionsWithIndex{（pid，it）=>
如果（pid在运行sortByKey之后，您可以使用zipWithIndex和filter来获得更精细的结果，但我还没有使用多个分区对其进行测试。这一点很好！但是zipWithIndex也会导致额外的计算。如果您想按绝对大小而不是按比率进行截断，则需要额外的计算，而不是按基本的排序和排序。）-使用索引技巧进行过滤。
def top50(rdd: RDD[(Double, String)]) = {
  val sorted = rdd.sortByKey(ascending = false)
  val partitions = sorted.partitions.size
  // Throw away the contents of the lower partitions.
  sorted.mapPartitionsWithIndex { (pid, it) =>
    if (pid <= partitions / 2) it else Nil
  }
}