Scala 需要使用DataFrameAPI而不是RDD转换数据帧
我有一个包含以下数据的数据框 贷款,MTG,111 贷款,MTG 102 贷款,信用违约金,103 贷款,PCL,104 贷款,PCL,105 我想得到这样的结果 贷款,MTG:111:102,PCL:104:105,CRDS:103 我能够使用RDD转换来实现它Scala 需要使用DataFrameAPI而不是RDD转换数据帧,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我有一个包含以下数据的数据框 贷款,MTG,111 贷款,MTG 102 贷款,信用违约金,103 贷款,PCL,104 贷款,PCL,105 我想得到这样的结果 贷款,MTG:111:102,PCL:104:105,CRDS:103 我能够使用RDD转换来实现它 var data = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105)
var data = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105))
var fd1 = sc.parallelize(data)
var fd2 = fd1.map(x => ( (x(0),x(1)) , x(2) ) )
var fd3 = fd2.reduceByKey( (a,b) => a.toString + ":" + b.toString )
var fd4 = fd3.map( x=> (x._1._1,(x._1._2 + ":"+ x._2)))
var fd5 = fd4.groupByKey()
我想使用dataframe/Dataset API或spark SQL来实现相同的结果。请帮忙。使用
.groupBy
,。从dataframe api收集列表和
内置函数
示例:
//sample dataframe
var data = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105)).toDF("col1","col2","col3")
import org.apache.spark.sql.functions._
data.show()
//+-----+----+----+
//| col1|col2|col3|
//+-----+----+----+
//|loans| MTG| 111|
//|loans| MTG| 102|
//|loans|CRDS| 103|
//|loans| PCL| 104|
//|loans| PCL| 105|
//+-----+----+----+
data.groupBy("col1","col2").
agg(concat_ws(":",collect_set("col3")).alias("col3")).
selectExpr("col1","""concat_ws(":",col2,col3) as col2""").
groupBy("col1").
agg(concat_ws(",",collect_list("col2")).alias("col2")).
show(false)
//+-----+--------------------------------+
//|col1 |col2 |
//+-----+--------------------------------+
//|loans|MTG:102:111,CRDS:103,PCL:104:105|
//+-----+--------------------------------+
//collect
data.groupBy("col1","col2").agg(concat_ws(":",collect_set("col3")).alias("col3")).selectExpr("col1","""concat_ws(":",col2,col3) as col2""").groupBy("col1").agg(concat_ws(",",collect_list("col2")).alias("col2")).collect()
//res22: Array[org.apache.spark.sql.Row] = Array([loans,MTG:102:111,CRDS:103,PCL:104:105])
谢谢,这很有帮助。