找出这个Spark Scala数据帧的分组逻辑
我有一个Spark数据帧DF,如下所示:找出这个Spark Scala数据帧的分组逻辑,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个Spark数据帧DF,如下所示: ---------------------------------------------------------- id | b | c | d | e | ---------------------------------------------------------- 1 | "ok" | 9 | "
----------------------------------------------------------
id | b | c | d | e |
----------------------------------------------------------
1 | "ok" | 9 | "dontcare" | "dontcare"|
1 | "not ok" | 10 | "dontcare" | "dontcare"|
1 | "sure" | 1 | "dontcare" | "dontcare"|
2 | "not sure" | 2 | "dontcare" | "dontcare"|
2 | "not so sure" | 12 | "dontcare" | "dontcare"|
1 | "sure bleh" | 1 | "dontcare" | "dontcare"|
3 | "not sure" | 5 | "dontcare" | "dontcare"|
3 | "not so sure" | 25 | "dontcare" | "dontcare"|
----------------------------------------------------------
我试图通过以下方式在Spark Scala中转换此DF来创建一个新DF:
----------------------------------------------------------------------------
id | grouping | count |
----------------------------------------------------------------------------
1 | (("ok",9),("not ok", 10), ("sure", 1), ("sure bleh", 1)) | 4 |
2 | (("not sure",2),("not so sure" , 12)) | 2 |
3 | (("not sure",5),("not so sure" , 25)) | 2 |
----------------------------------------------------------------------------
使用Spark Scala创建此DF的最佳方法是什么?我在试图找出这种分组逻辑时被卡住了。
到目前为止,我已经尝试过:
val df = spark.read.option("header","true").option("delimiter","\t").csv("test.csv")
df.show
val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping")))
首先将列设为数组类型,然后使用groupBy
您忘记了计数列:.aggcount*.ascount没有问题。最后一句话,您也可以这样做:collect_listarrayb,c.asgrouping,这样您就不必创建temp列。
val df = spark.read.option("header","true").option("delimiter","\t").csv("test.csv")
df.show
val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping"), count("*").as("count")).orderBy("id")
finalDF.show(false)
finalDF.printSchema
+---+--------------------------------------------------+-----+
|id |grouping |count|
+---+--------------------------------------------------+-----+
|1 |[[ok, 9], [not ok, 10], [sure, 1], [sure bleh, 1]]|4 |
|2 |[[not sure, 2], [not so sure, 12]] |2 |
|3 |[[not sure, 5], [not so sure, 25]] |2 |
+---+--------------------------------------------------+-----+
root
|-- id: string (nullable = true)
|-- grouping: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- count: long (nullable = false)