Scala SparkSQL:基于GroupBy之后的列的平均值

Scala SparkSQL:基于GroupBy之后的列的平均值,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个学生成绩的rdd,我需要首先按照第一列(大学)对他们进行分组,然后像这样显示每门课程中学生人数的平均值。执行此查询的最简单方法是什么 +----------+-------------------+ |university| avg of students | +----------+--------------------+ | MIT|

我有一个学生成绩的rdd,我需要首先按照第一列(大学)对他们进行分组,然后像这样显示每门课程中学生人数的平均值。执行此查询的最简单方法是什么

+----------+-------------------+                                                              
|university|  avg of students   |
+----------+--------------------+
|       MIT|    3               |
| Cambridge|    2.66  
这是数据集

case class grade(university: String, courseId: Int, studentId: Int, grade: Double)

val grades = List(grade(
grade("Cambridge", 1, 1001, 4),
grade("Cambridge", 1, 1004, 4),
grade("Cambridge", 2, 1006, 3.5),
grade("Cambridge", 2, 1004, 3.5),
grade("Cambridge", 2, 1002, 3.5),
grade("Cambridge", 3, 1006, 3.5),
grade("Cambridge", 3, 1007, 5),
grade("Cambridge", 3, 1008, 4.5),
grade("MIT", 1, 1001, 4),
grade("MIT", 1, 1002, 4),
grade("MIT", 1, 1003, 4),
grade("MIT", 1, 1004, 4),
grade("MIT", 1, 1005, 3.5),
grade("MIT", 2, 1009, 2))
1) 第一groupBy大学

2) 然后得到每所大学的课程数

3) 然后是groupBy courseId

4) 然后计算每门课程的学生人数

grades.groupBy(_.university).map { case (k, v) =>
    val courseCount = v.map(_.courseId).distinct.length
    val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
    k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
  }
Scala REPL

scala> val grades = List(
      grade("Cambridge", 1, 1001, 4),
      grade("Cambridge", 1, 1004, 4),
      grade("Cambridge", 2, 1006, 3.5),
      grade("Cambridge", 2, 1004, 3.5),
      grade("Cambridge", 2, 1002, 3.5),
      grade("Cambridge", 3, 1006, 3.5),
      grade("Cambridge", 3, 1007, 5),
      grade("Cambridge", 3, 1008, 4.5),
      grade("MIT", 1, 1001, 4),
      grade("MIT", 1, 1002, 4),
      grade("MIT", 1, 1003, 4),
      grade("MIT", 1, 1004, 4),
      grade("MIT", 1, 1005, 3.5),
      grade("MIT", 2, 1009, 2))
// grades: List[grade] = List(...)

scala> grades.groupBy(_.university).map { case (k, v) =>
      val courseCount = v.map(_.courseId).distinct.length
      val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
      k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
    }
// res2: Map[String, Double] = Map("MIT" -> 3.0, "Cambridge" -> 2.6666666666666665)
1) 第一groupBy大学

2) 然后得到每所大学的课程数

3) 然后是groupBy courseId

4) 然后计算每门课程的学生人数

grades.groupBy(_.university).map { case (k, v) =>
    val courseCount = v.map(_.courseId).distinct.length
    val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
    k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
  }
Scala REPL

scala> val grades = List(
      grade("Cambridge", 1, 1001, 4),
      grade("Cambridge", 1, 1004, 4),
      grade("Cambridge", 2, 1006, 3.5),
      grade("Cambridge", 2, 1004, 3.5),
      grade("Cambridge", 2, 1002, 3.5),
      grade("Cambridge", 3, 1006, 3.5),
      grade("Cambridge", 3, 1007, 5),
      grade("Cambridge", 3, 1008, 4.5),
      grade("MIT", 1, 1001, 4),
      grade("MIT", 1, 1002, 4),
      grade("MIT", 1, 1003, 4),
      grade("MIT", 1, 1004, 4),
      grade("MIT", 1, 1005, 3.5),
      grade("MIT", 2, 1009, 2))
// grades: List[grade] = List(...)

scala> grades.groupBy(_.university).map { case (k, v) =>
      val courseCount = v.map(_.courseId).distinct.length
      val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
      k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
    }
// res2: Map[String, Double] = Map("MIT" -> 3.0, "Cambridge" -> 2.6666666666666665)

你说的学生人数是什么意思?你是说学生人数吗?是的。每所大学课程中学生人数的平均值。在剑桥是(2+3+3)/3,在密苏里是(5+1)/2。你说的学生人数是什么意思?你是说学生人数吗?是的。每所大学课程中学生人数的平均值。剑桥大学为(2+3+3)/3,伦敦大学为(5+1)/2MIT@sina你也可以通过回答来感谢我:)这个解决方案不是用spark写的。@eliasah这只是给你一个如何得到答案的想法。。我认为它可以翻译成各自的领域,使它work@sina你也可以通过回答来感谢我:)这个解决方案不是用spark写的。@eliasah这只是给你一个如何得到答案的想法。。我认为它可以被翻译到各自的领域,使其工作