Scala 需要使用DataFrameAPI而不是RDD转换数据帧

Scala 需要使用DataFrameAPI而不是RDD转换数据帧,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我有一个包含以下数据的数据框 贷款,MTG,111 贷款,MTG 102 贷款,信用违约金,103 贷款,PCL,104 贷款,PCL,105 我想得到这样的结果 贷款,MTG:111:102,PCL:104:105,CRDS:103 我能够使用RDD转换来实现它 var data = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105)

我有一个包含以下数据的数据框

贷款,MTG,111 贷款,MTG 102 贷款,信用违约金,103 贷款,PCL,104 贷款,PCL,105

我想得到这样的结果

贷款,MTG:111:102,PCL:104:105,CRDS:103

我能够使用RDD转换来实现它

var data  = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105))


var fd1 = sc.parallelize(data)

var fd2 = fd1.map(x => ( (x(0),x(1)) , x(2) ) )

var fd3 = fd2.reduceByKey( (a,b) => a.toString + ":" + b.toString  )

var fd4 = fd3.map( x=> (x._1._1,(x._1._2 + ":"+ x._2)))

var fd5 = fd4.groupByKey()

我想使用dataframe/Dataset API或spark SQL来实现相同的结果。请帮忙。

使用
.groupBy
。从dataframe api收集列表和
内置函数

示例:

//sample dataframe
var data  = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105)).toDF("col1","col2","col3")

import org.apache.spark.sql.functions._

data.show()
//+-----+----+----+
//| col1|col2|col3|
//+-----+----+----+
//|loans| MTG| 111|
//|loans| MTG| 102|
//|loans|CRDS| 103|
//|loans| PCL| 104|
//|loans| PCL| 105|
//+-----+----+----+

data.groupBy("col1","col2").
agg(concat_ws(":",collect_set("col3")).alias("col3")).
selectExpr("col1","""concat_ws(":",col2,col3) as col2""").
groupBy("col1").
agg(concat_ws(",",collect_list("col2")).alias("col2")).
show(false)

//+-----+--------------------------------+
//|col1 |col2                            |
//+-----+--------------------------------+
//|loans|MTG:102:111,CRDS:103,PCL:104:105|
//+-----+--------------------------------+

//collect
data.groupBy("col1","col2").agg(concat_ws(":",collect_set("col3")).alias("col3")).selectExpr("col1","""concat_ws(":",col2,col3) as col2""").groupBy("col1").agg(concat_ws(",",collect_list("col2")).alias("col2")).collect()
//res22: Array[org.apache.spark.sql.Row] = Array([loans,MTG:102:111,CRDS:103,PCL:104:105])

谢谢,这很有帮助。