Sql Spark查询,用于聚合和更新父子关系表中的值

Sql Spark查询,用于聚合和更新父子关系表中的值,sql,scala,apache-spark,apache-spark-sql,Sql,Scala,Apache Spark,Apache Spark Sql,我有以下表格(每个表格也有其他列,但这里不需要) 项目表 user_id, party_id, parent_user_id 1, X, null 2, X, 1 3, X, 1 4, Z, null 5, Z, 4 6, Y, null

我有以下表格(每个表格也有其他列,但这里不需要)

项目表

      user_id, party_id, parent_user_id
        1,        X,      null
        2,        X,       1
        3,        X,       1
        4,        Z,       null
        5,        Z,       4
        6,        Y,       null
        7,        Y,       6
        8,        Y,       6
        1,        Y,       null
      user_id, party_id, priority
          1,      X,     0.3 
          2,      X,     0.8
          3,      X,     0.5 
          4,      Z,     0.1
          5,      Z,     0.2
          6,      Y,     0.7
          7,      Y,     0.4
          8,      Y,     0.5
          1,      Y,     0.3
      user_id, party_id, priority
          1,      X,     0.8 
          2,      X,     0.8
          3,      X,     0.8 
          4,      Z,     0.2
          5,      Z,     0.2
          6,      Y,     0.7
          7,      Y,     0.7
          8,      Y,     0.7
          1,      Y,     0.3
其中nullparent\u user\u id表示它是父用户,然后无论哪一列有parent\u user\u id,它都表示它是子用户

优先级表

      user_id, party_id, parent_user_id
        1,        X,      null
        2,        X,       1
        3,        X,       1
        4,        Z,       null
        5,        Z,       4
        6,        Y,       null
        7,        Y,       6
        8,        Y,       6
        1,        Y,       null
      user_id, party_id, priority
          1,      X,     0.3 
          2,      X,     0.8
          3,      X,     0.5 
          4,      Z,     0.1
          5,      Z,     0.2
          6,      Y,     0.7
          7,      Y,     0.4
          8,      Y,     0.5
          1,      Y,     0.3
      user_id, party_id, priority
          1,      X,     0.8 
          2,      X,     0.8
          3,      X,     0.8 
          4,      Z,     0.2
          5,      Z,     0.2
          6,      Y,     0.7
          7,      Y,     0.7
          8,      Y,     0.7
          1,      Y,     0.3
我要做的是编写一个spark查询来转换优先级表,如下所示。 逻辑应该是,对于每个父用户及其子用户,计算其最大优先级,即最大优先级(父用户优先级、子用户1优先级、子用户2优先级等),然后更改优先级以反映该父用户及其所有子用户的最大值(如下更新表中所述)

优先级表

      user_id, party_id, parent_user_id
        1,        X,      null
        2,        X,       1
        3,        X,       1
        4,        Z,       null
        5,        Z,       4
        6,        Y,       null
        7,        Y,       6
        8,        Y,       6
        1,        Y,       null
      user_id, party_id, priority
          1,      X,     0.3 
          2,      X,     0.8
          3,      X,     0.5 
          4,      Z,     0.1
          5,      Z,     0.2
          6,      Y,     0.7
          7,      Y,     0.4
          8,      Y,     0.5
          1,      Y,     0.3
      user_id, party_id, priority
          1,      X,     0.8 
          2,      X,     0.8
          3,      X,     0.8 
          4,      Z,     0.2
          5,      Z,     0.2
          6,      Y,     0.7
          7,      Y,     0.7
          8,      Y,     0.7
          1,      Y,     0.3

你是如何写一个火花作业来完成这个任务的?您甚至可以先编写一个基本的sql查询来完成这项工作。

我的解决方案基于以下几点

1) 根据
用户id
参与方id
加入
项目
优先级

val itemTbl = Seq((1, "X", None),
      (2, "X", Some(1)),
      (3, "X", Some(1)),
      (4, "Z", None),
      (5, "Z", Some(4)),
      (6, "Y", None),
      (7, "Y", Some(6)),
      (8, "Y", Some(6)),
      (1, "Y", None)).toDF("user_id", "party_id", "parent_user_id")


    val priorityTbl = Seq((1, "X", 0.3),
      (2, "X", 0.8),
      (3, "X", 0.5),
      (4, "Z", 0.1),
      (5, "Z", 0.2),
      (6, "Y", 0.7),
      (7, "Y", 0.4),
      (8, "Y", 0.5),
      (1, "Y", 0.3)).toDF("user_id", "party_id", "priority")

  //replace null parent_user_id with actual value in Item table.
    val replaceExp = when(col("parent_user_id") isNull, col("user_id")).otherwise(col("parent_user_id"))
    val itemTblModf = itemTbl.withColumn("parent_user_id", replaceExp)

val windowSpec = Window.partitionBy("party_id", "parent_user_id")
itemTblModf.join(priorityTbl, itemTbl("user_id") <=> priorityTbl("user_id") && itemTbl("party_id") <=> priorityTbl("party_id"))
  .drop(priorityTbl("user_id"))
  .drop(priorityTbl("party_id"))
  .withColumn("new_priority", max("priority") over windowSpec).show(200, false)
2) 用它的
user\u id
替换空的
parent\u user\u id
(这很神奇)

3) **如果需要,删除不必要的列

4) 应用窗口功能并在
参与方id
中查找最大优先级

val itemTbl = Seq((1, "X", None),
      (2, "X", Some(1)),
      (3, "X", Some(1)),
      (4, "Z", None),
      (5, "Z", Some(4)),
      (6, "Y", None),
      (7, "Y", Some(6)),
      (8, "Y", Some(6)),
      (1, "Y", None)).toDF("user_id", "party_id", "parent_user_id")


    val priorityTbl = Seq((1, "X", 0.3),
      (2, "X", 0.8),
      (3, "X", 0.5),
      (4, "Z", 0.1),
      (5, "Z", 0.2),
      (6, "Y", 0.7),
      (7, "Y", 0.4),
      (8, "Y", 0.5),
      (1, "Y", 0.3)).toDF("user_id", "party_id", "priority")

  //replace null parent_user_id with actual value in Item table.
    val replaceExp = when(col("parent_user_id") isNull, col("user_id")).otherwise(col("parent_user_id"))
    val itemTblModf = itemTbl.withColumn("parent_user_id", replaceExp)

val windowSpec = Window.partitionBy("party_id", "parent_user_id")
itemTblModf.join(priorityTbl, itemTbl("user_id") <=> priorityTbl("user_id") && itemTbl("party_id") <=> priorityTbl("party_id"))
  .drop(priorityTbl("user_id"))
  .drop(priorityTbl("party_id"))
  .withColumn("new_priority", max("priority") over windowSpec).show(200, false)

我不会说我的解决方案是非常有效的,但它是一个可行的解决方案。同样的结果也可以通过使用RDD以更有效的方式实现

这看起来是一个可行的解决方案。唯一的问题是用户id:1和参与方id:Y应该是0.3而不是0.7。如果查看item表,则user_id:1和party_id:Y是父用户。所以应该是0.3。我正在研究tat。有趣的场景。完美!!谢谢