Scala Spark：使用Salting处理连接中的数据倾斜_Scala_Apache Spark_Apache Spark Sql

Scala Spark：使用Salting处理连接中的数据倾斜

scala apache-spark

Scala Spark：使用Salting处理连接中的数据倾斜,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我喜欢编写一个函数来处理连接两个Spark数据集时的数据倾斜数据帧的解决方案非常简单： def saltedJoin(left: DataFrame, right: DataFrame, e: Column, kind: String = "inner", replicas: Int): DataFrame = { val saltedLeft = left. withColumn("__temporarily__", typedLit((0 until replicas)

我喜欢编写一个函数来处理连接两个Spark数据集时的数据倾斜

数据帧的解决方案非常简单：

def saltedJoin(left: DataFrame, right: DataFrame, e: Column, kind: String = "inner", replicas: Int): DataFrame = {
    val saltedLeft = left.
      withColumn("__temporarily__", typedLit((0 until replicas).toArray)).
      withColumn("__skew_left__", explode($"__temporarily__")).
      drop($"__temporarily__").
      repartition($"__skew_left__")

    val saltedRight = right.
      withColumn("__temporarily__", rand).
      withColumn("__skew_right__", ($"__temporarily__" * replicas).cast("bigint")).
      drop("__temporarily__").
      repartition($"__skew_right__")

    saltedLeft.
      join(saltedRight, $"__skew_left__" === $"__skew_right__" && e, kind).
      drop($"__skew_left__").
      drop($"__skew_right__")
  }

您使用的函数如下所示：

val joined = saltedJoin(df alias "l", df alias "r", $"l.x" === $"r.x", replicas = 5)

但是，我不知道如何为

Dataset

实例编写连接函数。到目前为止，我写了以下内容：

def saltedJoinWith[A: Encoder : TypeTag, B: Encoder : TypeTag](left: Dataset[A],
                                             right: Dataset[B],
                                             e: Column,
                                             kind: String = "inner",
                                             replicas: Int): Dataset[(A, B)] = {
    val spark = left.sparkSession
    val random = new Random()
    import spark.implicits._

    val saltedLeft: Dataset[(A, Int)] = left flatMap (a => 0 until replicas map ((a, _)))
    val saltedRight: Dataset[(B, Int)] = right map ((_, random.nextInt(replicas)))

    saltedLeft.joinWith(saltedRight, saltedLeft("_2") === saltedRight("_2") && e, kind).map(x => (x._1._1, x._2._1))
  }

这显然不是正确的解决方案，因为联接条件

没有指向在

saltedRight

和

saltedlight

中定义的列。它指向

saltedRight.\u 1

和

saltedLeft.\u 1

中的列。因此，例如，

val j=saltedJoinWith（ds别名“l”、ds别名“r”、“l.x”===$“r.x”、副本=5）

将在运行时失败，出现以下异常：

org.apache.spark.sql.AnalysisException: cannot resolve '`l.x`' given input columns: [_1, _2];;

我使用的是ApacheSpark2.2。

如果在函数中将数据集转换为数据帧并执行通常的步骤，该怎么办？我考虑过了。如何将生成的数据帧转换为元组数据集？您可以事先创建一个case类，并将该case类应用于生成的DF，如

DF.as（class）

@Ashkan。此条件如何工作“连接（……，$“skew_left”===$“skew_right”）”-从左表开始，它有1,2,3,4和5，从skew_right开始，它有一些随机数*（1,2,3,4,5,6）…如何在连接条件下匹配。@LearnHadoop。从左表中，它将有1、1、1、1、1、2、2、2、3、3、3、3、3、3、3、5、5、5、5、5、5，从右表中，它将有1、2、3、4、5。我们保证右表中至少有一行与左表匹配。因此，它将按预期工作。