Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala spark按键将多个rdd项分组_Scala_Apache Spark - Fatal编程技术网

Scala spark按键将多个rdd项分组

Scala spark按键将多个rdd项分组,scala,apache-spark,Scala,Apache Spark,我有rdd项目,如: (3922774869,10,1) (3922774869,11,1) (3922774869,12,2) (3922774869,13,2) (1779744180,10,1) (1779744180,11,1) (3922774869,14,3) (3922774869,15,2) (1779744180,16,1) (3922774869,12,1) (3922774869,13,1) (1779744180,14,1) (1779744180,15,1) (1779

我有rdd项目,如:

(3922774869,10,1)
(3922774869,11,1)
(3922774869,12,2)
(3922774869,13,2)
(1779744180,10,1)
(1779744180,11,1)
(3922774869,14,3)
(3922774869,15,2)
(1779744180,16,1)
(3922774869,12,1)
(3922774869,13,1)
(1779744180,14,1)
(1779744180,15,1)
(1779744180,16,1)
(3922774869,14,2)
(3922774869,15,1)
(1779744180,16,1)
(1779744180,17,1)
(3922774869,16,4)
...
表示
(id,age,count)
,我想对这些行进行分组以生成一个数据集,其中每行表示每个id的年龄分布,如下所示(
(id,age)
是uniq):

这是
(id,(年龄,计数),(年龄,计数)…)


有人能给我一个线索吗?

您可以先按两个字段进行缩减,然后使用groupBy:

rdd
  .map { case (id, age, count) => ((id, age), count) }.reduceByKey(_ + _)
  .map { case ((id, age), count) => (id, (age, count)) }.groupByKey()
它返回一个
RDD[(Long,Iterable[(Int,Int)])]
,对于上面的输入,它将包含以下两个记录:

(1779744180,CompactBuffer((16,3), (15,1), (14,1), (11,1), (10,1), (17,1)))
(3922774869,CompactBuffer((11,1), (12,3), (16,4), (13,3), (15,3), (10,1), (14,5)))

正如Tzach Zohar所建议的,您可以首先重塑RDD以适应键/值RDD。如果您有一个非常大的数据集,我建议您不要使用
groupByKey
,以减少洗牌,尽管这看起来非常简单。 例如,基于已发布的解决方案:

import scala.collection.mutable

val rddById = rdd.map { case (id, age, count) => ((id, age), count) }.reduceByKey(_ + _)
val initialSet = mutable.HashSet.empty[(Int, Int)]
val addToSet = (s: mutable.HashSet[(Int, Int)], v: (Int, Int)) => s += v
val mergePartitionSets = (p1: mutable.HashSet[(Int, Int)], p2: mutable.HashSet[(Int, Int)]) => p1 ++= p2
val uniqueByKey = rddById.aggregateByKey(initialSet)(addToSet, mergePartitionSets)
这将导致

uniqueByKey: org.apache.spark.rdd.RDD[(AnyVal, scala.collection.mutable.HashSet[(Int, Int)])]
您可以将这些值打印为:

scala> uniqueByKey.foreach(println)
(1779744180,Set((15,1), (16,3)))
(1779744180,Set((14,1), (11,1), (10,1), (17,1)))
(3922774869,Set((12,3), (11,1), (10,1), (14,5), (16,4), (15,3), (13,3)))
洗牌可能是一个很大的瓶颈。拥有许多大的哈希集(根据您的数据集)也可能是一个问题。但是,与网络延迟(以及洗牌带来的所有问题)相比,您更有可能拥有大量的ram(例如64GB的ram?),这会导致在分布式计算机上更快地读取/写入


要了解更多信息,请查看此

从技术上讲,问题不在于数据集的大小,而在于最大密钥分组的大小(相对于工作节点)。
scala> uniqueByKey.foreach(println)
(1779744180,Set((15,1), (16,3)))
(1779744180,Set((14,1), (11,1), (10,1), (17,1)))
(3922774869,Set((12,3), (11,1), (10,1), (14,5), (16,4), (15,3), (13,3)))