Scala Spark sql数据帧连接什么'；发生什么事了？_Scala_Apache Spark_Apache Spark Sql_Spark Dataframe

Scala Spark sql数据帧连接什么'；发生什么事了？

scala apache-spark

Scala Spark sql数据帧连接什么'；发生什么事了？,scala,apache-spark,apache-spark-sql,spark-dataframe,Scala,Apache Spark,Apache Spark Sql,Spark Dataframe,我有两个数据帧，为了简单起见，让我们把它们分别称为left和right，我只展示示例结构数据帧“左”：（此数据帧相当大）这将返回一个具有预期连接的数据帧我实际上要做的是在col1和col2上进行匹配，实际上是返回以下内容 src | dst | src_loc | src_name | dst_loc | dst_name --------------------------------------------------- b | a | b | Paris |

我有两个数据帧，为了简单起见，让我们把它们分别称为left和right，我只展示示例结构

数据帧“左”：（此数据帧相当大）

这将返回一个具有预期连接的数据帧

我实际上要做的是在col1和col2上进行匹配，实际上是返回以下内容

src | dst | src_loc | src_name | dst_loc | dst_name --------------------------------------------------- b | a | b | Paris | a | London c | b | null | null | b | Paris a | c | a | London | null | null 出于沮丧，我尝试从第二个相同的配置单元查询创建一个新的数据帧，而不是重用正确的数据帧

以下方法可行，但对我来说似乎非常错误（不需要为相同的数据调用两次hive）

我遇到的ext问题是，我想对已添加的列进行筛选，为了参数起见，让我们假设我想得到src loc name为Paris的所有列

dfjoin1.filter($"name" === "Paris")

由于列名称不明确，此操作失败。我如何解决这个问题？作为联接的一部分，我可以很容易地在列前面加上名称吗？

不确定-但我认为失败的原因也是列的模糊性-当您比较

dfjoin1（“dst”）===right（“loc”）

时，您实际上可能是在将

dst

与先前联接操作联接的

loc

列进行比较

换句话说，我相信这两个问题都可以通过更准确的列命名来解决，这样可以确保没有歧义。实现这一点（并获得所需的输出模式）的更简单方法是在每次联接后重命名列：

val result = left
  .join(right, $"src" === $"loc", "left_outer")
  .withColumnRenamed("loc", "src_loc")
  .withColumnRenamed("name", "src_name")
  .join(right, $"dst" === $"loc", "left_outer") // "loc" is now non-ambiguous, because we renamed left's "loc"
  .withColumnRenamed("loc", "dst_loc")
  .withColumnRenamed("name", "dst_name")

result.show()
// +---+---+-------+--------+-------+--------+
// |src|dst|src_loc|src_name|dst_loc|dst_name|
// +---+---+-------+--------+-------+--------+
// |  b|  a|      b|   Paris|      a|  London|
// |  c|  b|   null|    null|      b|   Paris|
// |  a|  c|      a|  London|   null|    null|
// +---+---+-------+--------+-------+--------+

另一种方法是使用

DataFrame.as（String）

在使用正确的数据帧之前为其命名，每次使用不同的名称。结果略有不同，但仍然可用：

left
  .join(right.as("src"), $"src" === $"src.loc", "left_outer")
  .join(right.as("dst"), $"dst" === $"dst.loc", "left_outer")
  .show()

// +---+---+----+------+----+------+
// |src|dst| loc|  name| loc|  name|
// +---+---+----+------+----+------+
// |  b|  a|   b| Paris|   a|London|
// |  c|  b|null|  null|   b| Paris|
// |  a|  c|   a|London|null|  null|
// +---+---+----+------+----+------+

该模式显示了

loc

和

name

的两个同名列，但它们实际上可以用相关前缀引用，例如

src.name

或

dst.loc

进一步到Tzach Zohar，如注释中所述，如果有很多列，重命名它们会变得非常难看。要解决此问题，您可以使用表架构获取列的名称，并在所有列前面加上名称，如下所示：

var tmp = left.join(right,$"src" === $"loc", "left_outer")

right.schema.fields.foreach { x => tmp = tmp.withColumnRenamed(x.name, "src_" + x.name) }

tmp = tmp.join(right,$"dst" === $"loc", "left_outer")

right.schema.fields.foreach { x => tmp = tmp.withColumnRenamed(x.name, "dst_" + x.name) }

// +---+---+-------+--------+-------+--------+
// |src|dst|src_loc|src_name|dst_loc|dst_name|
// +---+---+-------+--------+-------+--------+
// |  b|  a|      b|   Paris|      a|  London|
// |  c|  b|   null|    null|      b|   Paris|
// |  a|  c|      a|  London|   null|    null|
// +---+---+-------+--------+-------+--------+

是否有任何方法可以为联接中的所有列添加前缀，或者是否需要对每一列执行WithColumnRename操作？我在右边的表格中有大约30列，这意味着大约有60条重命名语句，似乎有点过分，但我想可能是必要的。我想知道这是否真的能解决“整个Spark作业都失败了，这没有错，但可能需要太长时间，或者正在发生什么事情”的问题-如果没有，请评论！行，我周一就可以测试完整的数据集了。左表显示了大约10亿个结果，但有时可能会更多。右边的表最多只有几千条记录。我在下面添加了一个答案，并对withColumn变量进行了轻微修改。我已经在本地数据集上进行了测试，似乎工作正常，还没有机会在大型数据集上进行测试。

val right = hiveContext.sql(FROM .....)
val right2 = hiveContext.sql(FROM .....)

val dfjoin1 = left.join(right, left("src") === right("loc"), "left_outer")
dfjoin1.join(right2, dfjoin1("dst") === right2("loc"), "left_outer")

dfjoin1.filter($"name" === "Paris")

val result = left
  .join(right, $"src" === $"loc", "left_outer")
  .withColumnRenamed("loc", "src_loc")
  .withColumnRenamed("name", "src_name")
  .join(right, $"dst" === $"loc", "left_outer") // "loc" is now non-ambiguous, because we renamed left's "loc"
  .withColumnRenamed("loc", "dst_loc")
  .withColumnRenamed("name", "dst_name")

result.show()
// +---+---+-------+--------+-------+--------+
// |src|dst|src_loc|src_name|dst_loc|dst_name|
// +---+---+-------+--------+-------+--------+
// |  b|  a|      b|   Paris|      a|  London|
// |  c|  b|   null|    null|      b|   Paris|
// |  a|  c|      a|  London|   null|    null|
// +---+---+-------+--------+-------+--------+

left
  .join(right.as("src"), $"src" === $"src.loc", "left_outer")
  .join(right.as("dst"), $"dst" === $"dst.loc", "left_outer")
  .show()

// +---+---+----+------+----+------+
// |src|dst| loc|  name| loc|  name|
// +---+---+----+------+----+------+
// |  b|  a|   b| Paris|   a|London|
// |  c|  b|null|  null|   b| Paris|
// |  a|  c|   a|London|null|  null|
// +---+---+----+------+----+------+

var tmp = left.join(right,$"src" === $"loc", "left_outer")

right.schema.fields.foreach { x => tmp = tmp.withColumnRenamed(x.name, "src_" + x.name) }

tmp = tmp.join(right,$"dst" === $"loc", "left_outer")

right.schema.fields.foreach { x => tmp = tmp.withColumnRenamed(x.name, "dst_" + x.name) }

// +---+---+-------+--------+-------+--------+
// |src|dst|src_loc|src_name|dst_loc|dst_name|
// +---+---+-------+--------+-------+--------+
// |  b|  a|      b|   Paris|      a|  London|
// |  c|  b|   null|    null|      b|   Paris|
// |  a|  c|      a|  London|   null|    null|
// +---+---+-------+--------+-------+--------+