Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何按列对数据进行分组并计算每组的观察数_Scala_Apache Spark - Fatal编程技术网

Scala 如何按列对数据进行分组并计算每组的观察数

Scala 如何按列对数据进行分组并计算每组的观察数,scala,apache-spark,Scala,Apache Spark,我有一个数据框df,有三列:id,类型和活动 val myData = (Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "hy"),("aa2", "GROUP_B", "14"), ("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_3", "12")) val df = sc.paralleli

我有一个数据框
df
,有三列:
id
类型
活动

val myData = (Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "hy"),("aa2", "GROUP_B", "14"),
              ("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_3", "12"))

val df = sc.parallelize(myData).toDF()
我需要按
类型对数据进行分组,然后计算每个
id
的活动数。这是预期的结果:

type      id    count
GROUP_A   aa1   2
GROUP_A   aa2   1
GROUP_B   aa3   3
GROUP_B   aa2   1
这就是我所尝试的:

df.groupBy("type","id").count().sort("count").show()

但是,它没有给出正确的结果。

我对您的样本数据进行了最低限度的更改,并且它对我有效:

//yours
val myData = (Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "hy"),("aa2", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_3", "12"))

//mine 
//removed the ( at the beginning
//changed GROUP_3 to GROUP_B
//other minor changes so that the resultant group by will look like you desired
val myData = Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "12"),("aa3", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_B", "12"))


//yours
val df = sc.parallelize(myData).toDF()
//mine
//added in column names

val df = sc.parallelize(myData).toDF("id","type","count")

df.groupBy("type","id").count.show
+-------+---+-----+
|   type| id|count|
+-------+---+-----+
|GROUP_A|aa1|    2|
|GROUP_A|aa2|    1|
|GROUP_B|aa2|    1|
|GROUP_B|aa3|    3|
+-------+---+-----+

有什么遗漏吗?

您可以在创建
数据帧时定义
列名
,并对
分组数据进行计数。这应该很容易

import sqlContext.implicits._

val myData = Seq(("aa1", "GROUP_A", "10"),
  ("aa1","GROUP_A", "12"),
  ("aa2","GROUP_A", "hy"),
  ("aa2", "GROUP_B", "14"),
  ("aa3","GROUP_B", "11"),
  ("aa3","GROUP_B","12" ),
  ("aa3", "GROUP_B", "12"))

val df = sc.parallelize(myData).toDF("id", "type", "activity")
df.groupBy("type","id").count().sort("count").show()

非常感谢。它应该是
toDF(“id”,“type”,“count”)
,因为
aa..
id
。让我检查一下。