找出这个Spark Scala数据帧的分组逻辑_Scala_Apache Spark_Apache Spark Sql

找出这个Spark Scala数据帧的分组逻辑

scala apache-spark

找出这个Spark Scala数据帧的分组逻辑,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个Spark数据帧DF，如下所示： ---------------------------------------------------------- id | b | c | d | e | ---------------------------------------------------------- 1 | "ok" | 9 | "

我有一个Spark数据帧DF，如下所示：

----------------------------------------------------------
id | b                      | c  | d          | e         |
----------------------------------------------------------
1  | "ok"                   | 9  | "dontcare" | "dontcare"|
1  | "not ok"               | 10 | "dontcare" | "dontcare"|
1  | "sure"                 | 1  | "dontcare" | "dontcare"|
2  | "not sure"             | 2  | "dontcare" | "dontcare"|
2  | "not so sure"          | 12 | "dontcare" | "dontcare"|
1  | "sure bleh"            | 1  | "dontcare" | "dontcare"|
3  | "not sure"             | 5  | "dontcare" | "dontcare"|
3  | "not so sure"          | 25 | "dontcare" | "dontcare"|
----------------------------------------------------------

我试图通过以下方式在Spark Scala中转换此DF来创建一个新DF：

----------------------------------------------------------------------------
id | grouping                                                       | count |
----------------------------------------------------------------------------
1  | (("ok",9),("not ok", 10), ("sure", 1), ("sure bleh", 1))       |   4   |        
2  | (("not sure",2),("not so sure" , 12))                          |   2   |        
3  | (("not sure",5),("not so sure" , 25))                          |   2   |        
----------------------------------------------------------------------------

使用Spark Scala创建此DF的最佳方法是什么？我在试图找出这种分组逻辑时被卡住了。到目前为止，我已经尝试过：

val df = spark.read.option("header","true").option("delimiter","\t").csv("test.csv")
df.show

val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping")))

首先将列设为数组类型，然后使用groupBy

您忘记了计数列：.aggcount*.ascount没有问题。最后一句话，您也可以这样做：collect_listarrayb，c.asgrouping，这样您就不必创建temp列。

val df = spark.read.option("header","true").option("delimiter","\t").csv("test.csv")
df.show

val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping"), count("*").as("count")).orderBy("id")
finalDF.show(false)
finalDF.printSchema

+---+--------------------------------------------------+-----+
|id |grouping                                          |count|
+---+--------------------------------------------------+-----+
|1  |[[ok, 9], [not ok, 10], [sure, 1], [sure bleh, 1]]|4    |
|2  |[[not sure, 2], [not so sure, 12]]                |2    |
|3  |[[not sure, 5], [not so sure, 25]]                |2    |
+---+--------------------------------------------------+-----+

root
 |-- id: string (nullable = true)
 |-- grouping: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- count: long (nullable = false)