Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/ant/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark/Scala:仅使用RDD使用ReduceByKey创建嵌套结构_Scala_Apache Spark_Rdd - Fatal编程技术网

Spark/Scala:仅使用RDD使用ReduceByKey创建嵌套结构

Spark/Scala:仅使用RDD使用ReduceByKey创建嵌套结构,scala,apache-spark,rdd,Scala,Apache Spark,Rdd,Spark/Scala:仅使用RDD使用ReduceByKey创建嵌套结构 我只想使用RDD创建嵌套结构。我可以使用groupBy函数来实现这一点,但对于海量数据来说,groupBy函数的性能并不好。所以我想用reduceByKey来做,但是我不能得到我想要的。任何帮助都将不胜感激 输入数据: val sales=sc.parallelize(List( ("West", "Apple", 2.0, 10), ("West", "Apple", 3.0, 15), ("Wes

Spark/Scala:仅使用RDD使用ReduceByKey创建嵌套结构

我只想使用RDD创建嵌套结构。我可以使用groupBy函数来实现这一点,但对于海量数据来说,groupBy函数的性能并不好。所以我想用reduceByKey来做,但是我不能得到我想要的。任何帮助都将不胜感激

输入数据:

val sales=sc.parallelize(List(
  ("West",  "Apple",  2.0, 10),
  ("West",  "Apple",  3.0, 15),
  ("West",  "Orange", 5.0, 15),
  ("South", "Orange", 3.0, 9),
  ("South", "Orange", 6.0, 18),
  ("East",  "Milk",   5.0, 5)))
所需输出是结构列表。我可以使用
groupByKey
完成这项工作,如下所示:

sales.map(value => (value._1 ,(value._2,value._3,value._4  )) )
  .groupBy(_._1)
  .map { case(k,v) => (k, v.map(_._2)) }
  .collect()
  .foreach(println)

// (South,List((Orange,3.0,9), (Orange,6.0,18)))
// (East,List((Milk,5.0,5)))
// (West,List((Apple,2.0,10), (Apple,3.0,15), (Orange,5.0,15)))
但我想用
reduceByKey
实现同样的目标。我无法获取列表[Struct]。相反,我可以得到List[List]。有没有办法获取列表[Struct]

sales.map(value => (value._1 ,List(value._2,value._3,value._4)))
  .reduceByKey((a,b) => (a ++ b))
  .collect()
  .foreach(println)

// (South,List(Orange, 3.0, 9, Orange, 6.0, 18))
// (East,List(Milk, 5.0, 5))
// (West,List(Apple, 2.0, 10, Apple, 3.0, 15, Orange, 5.0, 15))

sales.map(value => (value._1 ,List(value._2,value._3,value._4)))
  .reduceByKey((a,b) =>(List(a) ++ List(b)))
  .collect()
  .foreach(println)

// (South,List(List(Orange, 3.0, 9), List(Orange, 6.0, 18)))
// (East,List(Milk, 5.0, 5))
// (West,List(List(List(Apple, 2.0, 10), List(Apple, 3.0, 15)), List(Orange, 5.0, 15)))
  • 您不能-
    reduceByKey
    需要一个函数
    (V,V)⇒ V
    因此无法更改类型。例如,见
  • 您可以使用
    aggregateByKey
    combineByKey
    ,但不会提高性能,因为您的流程不会减少数据量。例如,见
  • 您可以通过(不需要临时对象)获得一点

sales.map(value => (value._1 ,(value._2,value._3,value._4)) ).groupByKey