Spark Scala:将数据帧列值聚合到有序列表中
我有一个spark scala数据框,它有四个值:(id、day、val、order)。我想创建一个包含以下列的新数据框:(id,day,value_list:list(val1,val2,…,valn)),其中val1到valn按asc顺序值排序 例如:Spark Scala:将数据帧列值聚合到有序列表中,scala,apache-spark,Scala,Apache Spark,我有一个spark scala数据框,它有四个值:(id、day、val、order)。我想创建一个包含以下列的新数据框:(id,day,value_list:list(val1,val2,…,valn)),其中val1到valn按asc顺序值排序 例如: (50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1) 将成为: ((51,113)
(50, 113, 1, 1),
(50, 113, 1, 3),
(50, 113, 2, 2),
(51, 114, 1, 2),
(51, 114, 2, 1),
(51, 113, 1, 1)
将成为:
((51,113),List(1))
((51,114),List(2, 1)
((50,113),List(1, 2, 1))
我很接近,但在我将数据聚合到一个列表中之后,我不知道该做什么。我不知道如何让spark按int的顺序排列每个值列表:
import org.apache.spark.sql.Row
val testList = List((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1))
val testDF = sqlContext.sparkContext.parallelize(testList).toDF("id1", "id2", "val", "order")
val rDD1 = testDF.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int) => ((key1, key2), List((val1, val2)))}
val rDD2 = rDD1.reduceByKey{case (x, y) => x ++ y}
其中输出如下所示:
((51,113),List((1,1)))
((51,114),List((1,2), (2,1)))
((50,113),List((1,3), (1,1), (2,2)))
下一步将是生产:
((51,113),List((1,1)))
((51,114),List((2,1), (1,2)))
((50,113),List((1,1), (2,2), (1,3)))
您只需映射您的
RDD
,然后使用sortBy
:
scala> val df = Seq((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1)).toDF("id1", "id2", "val", "order")
df: org.apache.spark.sql.DataFrame = [id1: int, id2: int, val: int, order: int]
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rDD1 = df.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int) => ((key1, key2), List((val1, val2)))}
rDD1: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[10] at map at <console>:28
scala> val rDD2 = rDD1.reduceByKey{case (x, y) => x ++ y}
rDD2: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = ShuffledRDD[11] at reduceByKey at <console>:30
scala> val rDD3 = rDD2.map(x => (x._1, x._2.sortBy(_._2)))
rDD3: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[12] at map at <console>:32
scala> rDD3.collect.foreach(println)
((51,113),List((1,1)))
((50,113),List((1,1), (2,2), (1,3)))
((51,114),List((2,1), (1,2)))
scala>val-df=Seq((50113,1,1)、(50113,1,3)、(50113,2,2)、(51114,1,2)、(51114,2,1)、(51113,1,1)).toDF(“id1”、“id2”、“val”、“order”)
df:org.apache.spark.sql.DataFrame=[id1:int,id2:int,val:int,order:int]
scala>import org.apache.spark.sql.Row
导入org.apache.spark.sql.Row
scala>valrdd1=df.map{case行(key1:Int,key2:Int,val1:Int,val2:Int)=>((key1,key2),List((val1,val2)))
rDD1:org.apache.spark.rdd.rdd[(Int,Int),List[(Int,Int)]]=MapPartitionsRDD[10]位于地图的第28页
scala>valrdd2=rDD1.reduceByKey{case(x,y)=>x++y}
rDD2:org.apache.spark.rdd.rdd[(Int,Int),List[(Int,Int)]]=shuffleddd[11]位于reduceByKey的:30
scala>val rDD3=rDD2.map(x=>(x._1,x._2.sortBy(_._2)))
rDD3:org.apache.spark.rdd.rdd[(Int,Int),List[(Int,Int)]]=MapPartitionsRDD[12]位于地图32处
scala>rDD3.collect.foreach(println)
((51113),列表((1,1)))
((50113),列表((1,1)、(2,2)、(1,3)))
((51114),列表((2,1),(1,2)))
testDF.groupBy("id1","id2").agg(collect_list($"val")).show
+---+---+-----------------+
|id1|id2|collect_list(val)|
+---+---+-----------------+
| 51|113| [1]|
| 51|114| [1, 2]|
| 50|113| [1, 1, 2]|
+---+---+-----------------+