Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
找出这个Spark Scala数据帧的分组逻辑_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

找出这个Spark Scala数据帧的分组逻辑

找出这个Spark Scala数据帧的分组逻辑,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个Spark数据帧DF,如下所示: ---------------------------------------------------------- id | b | c | d | e | ---------------------------------------------------------- 1 | "ok" | 9 | "

我有一个Spark数据帧DF,如下所示:

----------------------------------------------------------
id | b                      | c  | d          | e         |
----------------------------------------------------------
1  | "ok"                   | 9  | "dontcare" | "dontcare"|
1  | "not ok"               | 10 | "dontcare" | "dontcare"|
1  | "sure"                 | 1  | "dontcare" | "dontcare"|
2  | "not sure"             | 2  | "dontcare" | "dontcare"|
2  | "not so sure"          | 12 | "dontcare" | "dontcare"|
1  | "sure bleh"            | 1  | "dontcare" | "dontcare"|
3  | "not sure"             | 5  | "dontcare" | "dontcare"|
3  | "not so sure"          | 25 | "dontcare" | "dontcare"|
----------------------------------------------------------
我试图通过以下方式在Spark Scala中转换此DF来创建一个新DF:

----------------------------------------------------------------------------
id | grouping                                                       | count |
----------------------------------------------------------------------------
1  | (("ok",9),("not ok", 10), ("sure", 1), ("sure bleh", 1))       |   4   |        
2  | (("not sure",2),("not so sure" , 12))                          |   2   |        
3  | (("not sure",5),("not so sure" , 25))                          |   2   |        
----------------------------------------------------------------------------
使用Spark Scala创建此DF的最佳方法是什么?我在试图找出这种分组逻辑时被卡住了。 到目前为止,我已经尝试过:

val df = spark.read.option("header","true").option("delimiter","\t").csv("test.csv")
df.show

val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping")))


首先将列设为数组类型,然后使用groupBy


您忘记了计数列:.aggcount*.ascount没有问题。最后一句话,您也可以这样做:collect_listarrayb,c.asgrouping,这样您就不必创建temp列。
val df = spark.read.option("header","true").option("delimiter","\t").csv("test.csv")
df.show

val finalDF = df.groupBy("id").agg(collect_list(array("b", "c")).as("grouping"), count("*").as("count")).orderBy("id")
finalDF.show(false)
finalDF.printSchema
+---+--------------------------------------------------+-----+
|id |grouping                                          |count|
+---+--------------------------------------------------+-----+
|1  |[[ok, 9], [not ok, 10], [sure, 1], [sure bleh, 1]]|4    |
|2  |[[not sure, 2], [not so sure, 12]]                |2    |
|3  |[[not sure, 5], [not so sure, 25]]                |2    |
+---+--------------------------------------------------+-----+

root
 |-- id: string (nullable = true)
 |-- grouping: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- count: long (nullable = false)