Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark Scala:将数据帧列值聚合到有序列表中_Scala_Apache Spark - Fatal编程技术网

Spark Scala:将数据帧列值聚合到有序列表中

Spark Scala:将数据帧列值聚合到有序列表中,scala,apache-spark,Scala,Apache Spark,我有一个spark scala数据框,它有四个值:(id、day、val、order)。我想创建一个包含以下列的新数据框:(id,day,value_list:list(val1,val2,…,valn)),其中val1到valn按asc顺序值排序 例如: (50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1) 将成为: ((51,113)

我有一个spark scala数据框,它有四个值:(id、day、val、order)。我想创建一个包含以下列的新数据框:(id,day,value_list:list(val1,val2,…,valn)),其中val1到valn按asc顺序值排序

例如:

(50, 113, 1, 1), 
(50, 113, 1, 3), 
(50, 113, 2, 2), 
(51, 114, 1, 2), 
(51, 114, 2, 1), 
(51, 113, 1, 1)
将成为:

((51,113),List(1))
((51,114),List(2, 1)
((50,113),List(1, 2, 1))
我很接近,但在我将数据聚合到一个列表中之后,我不知道该做什么。我不知道如何让spark按int的顺序排列每个值列表:

import org.apache.spark.sql.Row

val testList = List((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1))
val testDF = sqlContext.sparkContext.parallelize(testList).toDF("id1", "id2", "val", "order")

val rDD1 = testDF.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int)  => ((key1, key2), List((val1, val2)))}
val rDD2 = rDD1.reduceByKey{case (x, y) =>  x ++ y}
其中输出如下所示:

((51,113),List((1,1)))
((51,114),List((1,2), (2,1)))
((50,113),List((1,3), (1,1), (2,2)))
下一步将是生产:

((51,113),List((1,1)))
((51,114),List((2,1), (1,2)))
((50,113),List((1,1), (2,2), (1,3)))

您只需映射您的
RDD
,然后使用
sortBy

scala> val df = Seq((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1)).toDF("id1", "id2", "val", "order")
df: org.apache.spark.sql.DataFrame = [id1: int, id2: int, val: int, order: int]

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val rDD1 = df.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int)  => ((key1, key2), List((val1, val2)))}
rDD1: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[10] at map at <console>:28

scala> val rDD2 = rDD1.reduceByKey{case (x, y) =>  x ++ y}
rDD2: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = ShuffledRDD[11] at reduceByKey at <console>:30

scala> val rDD3 = rDD2.map(x => (x._1, x._2.sortBy(_._2)))
rDD3: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[12] at map at <console>:32

scala> rDD3.collect.foreach(println)
((51,113),List((1,1)))
((50,113),List((1,1), (2,2), (1,3)))
((51,114),List((2,1), (1,2)))
scala>val-df=Seq((50113,1,1)、(50113,1,3)、(50113,2,2)、(51114,1,2)、(51114,2,1)、(51113,1,1)).toDF(“id1”、“id2”、“val”、“order”)
df:org.apache.spark.sql.DataFrame=[id1:int,id2:int,val:int,order:int]
scala>import org.apache.spark.sql.Row
导入org.apache.spark.sql.Row
scala>valrdd1=df.map{case行(key1:Int,key2:Int,val1:Int,val2:Int)=>((key1,key2),List((val1,val2)))
rDD1:org.apache.spark.rdd.rdd[(Int,Int),List[(Int,Int)]]=MapPartitionsRDD[10]位于地图的第28页
scala>valrdd2=rDD1.reduceByKey{case(x,y)=>x++y}
rDD2:org.apache.spark.rdd.rdd[(Int,Int),List[(Int,Int)]]=shuffleddd[11]位于reduceByKey的:30
scala>val rDD3=rDD2.map(x=>(x._1,x._2.sortBy(_._2)))
rDD3:org.apache.spark.rdd.rdd[(Int,Int),List[(Int,Int)]]=MapPartitionsRDD[12]位于地图32处
scala>rDD3.collect.foreach(println)
((51113),列表((1,1)))
((50113),列表((1,1)、(2,2)、(1,3)))
((51114),列表((2,1),(1,2)))
testDF.groupBy("id1","id2").agg(collect_list($"val")).show
+---+---+-----------------+                                                     
|id1|id2|collect_list(val)|
+---+---+-----------------+
| 51|113|              [1]|
| 51|114|           [1, 2]|
| 50|113|        [1, 1, 2]|
+---+---+-----------------+