Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/jsf-2/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Arrays scala中的按多个键分组_Arrays_Scala_Apache Spark - Fatal编程技术网

Arrays scala中的按多个键分组

Arrays scala中的按多个键分组,arrays,scala,apache-spark,Arrays,Scala,Apache Spark,我有一个类似于以下内容的数据帧: val df = sc.parallelize(Seq((100, 1, 1), (100, 1,2), (100, 2,3), (200, 1,1), (200, 2,3), (200, 2, 2), (200, 3, 1), (200, 3,2), (300, 1,1), (300,1,2), (300, 2,5), (400, 1, 6))).toDF("_c0", "_c1", "_c2") +---+---+--------------------+

我有一个类似于以下内容的数据帧:

val df = sc.parallelize(Seq((100, 1, 1), (100, 1,2), (100, 2,3), (200, 1,1), (200, 2,3), (200, 2, 2), (200, 3, 1), (200, 3,2), (300, 1,1), (300,1,2), (300, 2,5), (400, 1, 6))).toDF("_c0", "_c1", "_c2")

+---+---+--------------------+
|_c0|_c1|                 _c2|
+---+---+--------------------+
|100|  1|1                   |
|100|  1|2                   |
|100|  2|3                   |
|200|  1|1                   |
|200|  2|3                   |
|200|  2|2                   |
|200|  3|1                   |
|200|  3|2                   |
|300|  1|1                   |
|300|  1|2                   |
|300|  2|5                   |
|400|  1|6                   |
我需要按_c0和_c1分组,得到一些rdd,如下所示:

res9: Array[Array[Array[Int]]] = Array(Array(Array(1, 2), Array(3)), Array(Array(1), Array(3, 2), Array(1, 2)), Array(Array(1, 2), Array(5)), Array(Array(6)))

这是一个数组的数组,我是scala的新手。请尝试帮助

您可以先将
分组方式
放在一起,然后
分组方式
\u c1
以获得所需的结果。下面是相同的代码

//first group by "_c0" and "_c1"
val res = df.groupBy("_c0", "_c1").agg(collect_list("_c2").as("_c2"))
  //group by "_c0"
  .groupBy("_c0").agg(collect_list("_c2").as("_c2"))
  .select("_c2")

res.show(false)

//output
//+---------------------------------------------------------+
//|_c2                                                      |
//+---------------------------------------------------------+
//|[WrappedArray(1, 2), WrappedArray(5)]                    |
//|[WrappedArray(1, 2), WrappedArray(3)]                    |
//|[WrappedArray(6)]                                        |
//|[WrappedArray(3, 2), WrappedArray(1, 2), WrappedArray(1)]|
//+---------------------------------------------------------+
要将其转换为
RDD
,请将
.RDD
转换为生成的
dataframe

import scala.collection.mutable.WrappedArray
val rdd = res.rdd.map(x => x.get(0)
  .asInstanceOf[WrappedArray[WrappedArray[Int]]].array.map(x => x.toArray))

//to get the content or rdd(Don't use it if data is too big)
rdd.collect()
//output
//Array(Array(Array(1, 2), Array(5)), Array(Array(1, 2), Array(3)), Array(Array(6)), Array(Array(3, 2), Array(1, 2), Array(1)))

你能详细解释一下你的输出吗?我需要做c0和c1的分组。比如说,第一个是100,如果我们要做一个_c1的groupby,我期待一个数组。例如,[[1,2],[3]]对于_c0==100,您是如何生成此输出的<代码>res9:Array[Array[Int]]=Array(Array(Array)(1,2),Array(3)),Array(Array(1),Array(3,2),Array(1,2)),Array(1,2),Array(5)),Array(Array(6))
。所以对于[1,2],[3]]假设_c1是200,_c1有1,2和3,也就是说,3个群[[1],[3,2],[1,2]]仍然混乱?