Apache spark 当不满足所有选择条件时,Spark将选择哪个联接?

Apache spark 当不满足所有选择条件时,Spark将选择哪个联接?,apache-spark,join,apache-spark-sql,Apache Spark,Join,Apache Spark Sql,我们知道Spark中有三种类型的连接—广播连接、随机连接和排序合并连接: 小表连接大表时,使用广播连接 当小表大于阈值时,使用洗牌连接 当大表连接时,连接键可以排序,使用排序合并连接 如果有两个大表的联接,并且联接键无法排序,会发生什么情况?Spark将选择哪种连接类型?Spark 3.0及更高版本支持以下类型的连接: 广播散列连接 无序散列联接 无序排序合并联接SMJ 广播嵌套循环连接BNLJ 笛卡尔乘积联接 它们的选择最好在源代码中列出: 如上所述,应用选择的结果不仅取决于表的大小和键的可排

我们知道Spark中有三种类型的连接—广播连接、随机连接和排序合并连接:

小表连接大表时,使用广播连接 当小表大于阈值时,使用洗牌连接 当大表连接时,连接键可以排序,使用排序合并连接
如果有两个大表的联接,并且联接键无法排序,会发生什么情况?Spark将选择哪种连接类型?

Spark 3.0及更高版本支持以下类型的连接:

广播散列连接 无序散列联接 无序排序合并联接SMJ 广播嵌套循环连接BNLJ 笛卡尔乘积联接 它们的选择最好在源代码中列出:


如上所述,应用选择的结果不仅取决于表的大小和键的可排序性,而且还取决于连接类型内部、左/右、完全和连接键条件等/非等/θ。总的来说,在您的情况下,您可能会看到随机散列或广播嵌套循环。

Spark 3.0及更高版本支持以下类型的联接:

广播散列连接 无序散列联接 无序排序合并联接SMJ 广播嵌套循环连接BNLJ 笛卡尔乘积联接 它们的选择最好在源代码中列出:


如上所述,应用选择的结果不仅取决于表的大小和键的可排序性,而且还取决于连接类型内部、左/右、完全和连接键条件等/非等/θ。总体而言,在您的情况下,您可能会看到随机散列或广播嵌套循环。

请详细说明连接键无法排序…代码…请详细说明连接键无法排序…代码。。。
  /**
   * Select the proper physical plan for join based on join strategy hints, the availability of
   * equi-join keys and the sizes of joining relations. Below are the existing join strategies,
   * their characteristics and their limitations.
   *
   * - Broadcast hash join (BHJ):
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *     BHJ usually performs faster than the other join algorithms when the broadcast side is
   *     small. However, broadcasting tables is a network-intensive operation and it could cause
   *     OOM or perform badly in some cases, especially when the build/broadcast side is big.
   *
   * - Shuffle hash join:
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *
   * - Shuffle sort merge join (SMJ):
   *     Only supported for equi-joins and the join keys have to be sortable.
   *     Supported for all join types.
   *
   * - Broadcast nested loop join (BNLJ):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports all the join types, but the implementation is optimized for:
   *       1) broadcasting the left side in a right outer join;
   *       2) broadcasting the right side in a left outer, left semi, left anti or existence join;
   *       3) broadcasting either side in an inner-like join.
   *     For other cases, we need to scan the data multiple times, which can be rather slow.
   *
   * - Shuffle-and-replicate nested loop join (a.k.a. cartesian product join):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports only inner like joins.
   */
object JoinSelection extends Strategy with PredicateHelper { ...