Apache spark 在java中按键对RDD进行分组

Apache spark 在java中按键对RDD进行分组,apache-spark,rdd,Apache Spark,Rdd,我正在尝试使用groupby对RDD进行分组。大多数文档建议不要使用groupBy,因为它在内部是如何对密钥进行分组的。有没有其他方法可以实现这一点。我不能使用reducebyKey,因为我没有在这里执行还原操作 前 条目-长id,字符串名称; JavaRDD entries=rdd.groupBy(Entry::getId) .flatmap(x->someOp(x)) .values() .filter() 与聚合函数类似,只是聚合应用于具有相同键的值。与聚合函数不同,初始值不应用于第二个

我正在尝试使用groupby对RDD进行分组。大多数文档建议不要使用groupBy,因为它在内部是如何对密钥进行分组的。有没有其他方法可以实现这一点。我不能使用reducebyKey,因为我没有在这里执行还原操作

条目-长id,字符串名称;
JavaRDD entries=rdd.groupBy(Entry::getId)
.flatmap(x->someOp(x))
.values()
.filter()

与聚合函数类似,只是聚合应用于具有相同键的值。与聚合函数不同,初始值不应用于第二个reduce

列出变体

def聚合基[U](零值:U)(序号:(U,V)⇒ U、 组合:(U,U) ⇒ U) (隐式arg0:ClassTag[U]):RDD[(K,U)]

def聚合bykey[U](零值:U,numPartitions:Int)(seqOp:(U,V)⇒ U 组合:(U,U)⇒ U) (隐式arg0:ClassTag[U]):RDD[(K,U)]

defaggregateByKey[U](零值:U,分区器:分区器)(seqOp:U, (五)⇒ U、 组合:(U,U)⇒ U) (隐式arg0:ClassTag[U]):RDD[(K,U)]

例如:

val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

是的,如果它是一对RDD,您的解决方案将起作用。我必须先执行byKey()操作,然后将其转换为一对。要继续上述方法,必须先执行keyBy()
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
  iter.map(x => "[partID:" +  index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))