Apache spark 火花聚合来自联接输出的多个列(可能是数组)

Apache spark 火花聚合来自联接输出的多个列(可能是数组),apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有以下数据集 表1 表2 现在我想了解一下数据集。我已经尝试过左外部联接Table1.id==Table2.departmentid,但是没有得到所需的输出 稍后,我需要使用这个表来获得几个计数,并将数据转换为xml。我将使用map进行此转换 任何帮助都将不胜感激。仅加入不足以获得所需的输出。可能您缺少了一些内容,每个嵌套数组的最后一个元素可能是departmentid。假设嵌套数组的最后一个元素是departmentid,我通过以下方式生成了输出: import org.apache.

我有以下数据集

表1

表2

现在我想了解一下数据集。我已经尝试过左外部联接Table1.id==Table2.departmentid,但是没有得到所需的输出

稍后,我需要使用这个表来获得几个计数,并将数据转换为xml。我将使用map进行此转换


任何帮助都将不胜感激。

仅加入不足以获得所需的输出。可能您缺少了一些内容,每个嵌套数组的最后一个元素可能是
departmentid
。假设嵌套数组的最后一个元素是
departmentid
,我通过以下方式生成了输出:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val df = spark.sparkContext.parallelize(Seq(
   (1,"Physics"),
   (2,"Computer"),
   (3,"Maths")
 )).toDF("ID","Dept")

 val schema = List(
    StructField("EMPID", IntegerType, true),
    StructField("EMPNAME", StringType, true),
    StructField("DeptID", IntegerType, true)
  )

  val data = Seq(
    Row(1,"A",1),
    Row(2,"B",1),
    Row(3,"C",2),
    Row(4,"D",2) ,
    Row(5,"E",null)
  )

  val df_emp = spark.createDataFrame(
    spark.sparkContext.parallelize(data),
    StructType(schema)
  )

  val newdf =  df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))

  df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()

---+--------+--------------------+
| ID|    Dept|             listcol|
+---+--------+--------------------+
|  1| Physics|[[1, A, 1], [2, B...|
|  2|Computer|[[3, C, 2], [4, D...|
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list

case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)

val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
                            ,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
                      ,employee(2, "B", 1)
                      ,employee(3, "C", 2)
                      ,employee(4, "D", 2)).toDF()

val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
      selectExpr("id", "deptname", "employeid", "empname").
      rdd.map {
        case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
      }.toDF("id", "deptname", "arrayemp").
          groupBy("id", "deptname").
          agg(collect_list("arrayemp").as("emplist")).
        orderBy("id", "deptname")
输出如下所示:

result.show(false)
+---+--------+----------------------+
|id |deptname|emplist               |
+---+--------+----------------------+
|1  |physics |[[2, B, 1], [1, A, 1]]|
|2  |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
解释:如果我将上一个数据帧转换分解为多个步骤,可能会清楚地知道输出是如何生成的

部门和员工之间的左外部联接

val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
      selectExpr("id", "deptname", "employeid", "empname")
df1.show()
    +---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
|  1| physics|        2|      B|
|  1| physics|        1|      A|
|  2|computer|        4|      D|
|  2|computer|        3|      C|
+---+--------+---------+-------+
使用df1数据帧中某些列的值创建数组

val df2 = df1.rdd.map {
                case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
              }.toDF("id", "deptname", "arrayemp")
df2.show()
            +---+--------+---------+
        | id|deptname| arrayemp|
        +---+--------+---------+
        |  1| physics|[2, B, 1]|
        |  1| physics|[1, A, 1]|
        |  2|computer|[4, D, 2]|
        |  2|computer|[3, C, 2]|
        +---+--------+---------+
val result = df2.groupBy("id", "deptname").
              agg(collect_list("arrayemp").as("emplist")).
              orderBy("id", "deptname")
result.show(false)
            +---+--------+----------------------+
        |id |deptname|emplist               |
        +---+--------+----------------------+
        |1  |physics |[[2, B, 1], [1, A, 1]]|
        |2  |computer|[[4, D, 2], [3, C, 2]]|
        +---+--------+----------------------+
使用df2数据帧创建聚合多个阵列的新列表

val df2 = df1.rdd.map {
                case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
              }.toDF("id", "deptname", "arrayemp")
df2.show()
            +---+--------+---------+
        | id|deptname| arrayemp|
        +---+--------+---------+
        |  1| physics|[2, B, 1]|
        |  1| physics|[1, A, 1]|
        |  2|computer|[4, D, 2]|
        |  2|computer|[3, C, 2]|
        +---+--------+---------+
val result = df2.groupBy("id", "deptname").
              agg(collect_list("arrayemp").as("emplist")).
              orderBy("id", "deptname")
result.show(false)
            +---+--------+----------------------+
        |id |deptname|emplist               |
        +---+--------+----------------------+
        |1  |physics |[[2, B, 1], [1, A, 1]]|
        |2  |computer|[[4, D, 2], [3, C, 2]]|
        +---+--------+----------------------+
可能重复的