Scala 使用SQL表达式删除Spark中重复的列_Scala_Apache Spark_Apache Spark Sql

Scala 使用SQL表达式删除Spark中重复的列

scala apache-spark

Scala 使用SQL表达式删除Spark中重复的列,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我认为这个问题与其他一些问题类似，但没有被问到在Spark中，如何在删除重复列的情况下运行SQL查询例如，在spark上运行的SQL查询 select a.* from a left outer join select b.* from b on a.id = b.id 在这种情况下，如何删除重复的列b.id 我知道我们可以在Spark中使用其他步骤，例如提供ALA或重命名列，但是有没有一种更快的方法可以通过编写SQL查询来删除重复的列？我有两个数据帧，df1和df2，并将根据i

我认为这个问题与其他一些问题类似，但没有被问到

在Spark中，如何在删除重复列的情况下运行SQL查询

例如，在spark上运行的SQL查询

select a.* from a
left outer join
   select b.* from b
on a.id = b.id

在这种情况下，如何删除重复的列b.id

我知道我们可以在Spark中使用其他步骤，例如提供ALA或重命名列，但是有没有一种更快的方法可以通过编写SQL查询来删除重复的列？

我有两个数据帧，df1和df2，并将根据id列执行联接操作

scala> val df1  = Seq((1,"mahesh"), (2,"shivangi"),(3,"manoj")).toDF("id", "name")
df1: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> df1.show
+---+--------+
| id|    name|
+---+--------+
|  1|  mahesh|
|  2|shivangi|
|  3|   manoj|
+---+--------+

scala> val df2  = Seq((1,24), (2,23),(3,24)).toDF("id", "age")
df2: org.apache.spark.sql.DataFrame = [id: int, age: int]

scala> df2.show
+---+---+
| id|age|
+---+---+
|  1| 24|
|  2| 23|
|  3| 24|
+---+---+

这里有一个不正确的解决方案，将联接列定义为谓词

df1("id") === df2("id")

错误的结果是id列在联接的数据帧中重复：

scala> df1.join(df2, df1("id") === df2("id"), "left").show
+---+--------+---+---+
| id|    name| id|age|
+---+--------+---+---+
|  1|  mahesh|  1| 24|
|  2|shivangi|  2| 23|
|  3|   manoj|  3| 24|
+---+--------+---+---+

正确的解决方案是将联接列定义为字符串Seq（“id”）的数组，而不是表达式。然后，联接的dataframe没有重复的列

scala> df1.join(df2, Seq("id"),"left").show
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|  mahesh| 24|
|  2|shivangi| 23|
|  3|   manoj| 24|
+---+--------+---+

有关更多信息，请参见

我有两个数据帧，df1和df2，并将根据id列执行联接操作

scala> val df1  = Seq((1,"mahesh"), (2,"shivangi"),(3,"manoj")).toDF("id", "name")
df1: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> df1.show
+---+--------+
| id|    name|
+---+--------+
|  1|  mahesh|
|  2|shivangi|
|  3|   manoj|
+---+--------+

scala> val df2  = Seq((1,24), (2,23),(3,24)).toDF("id", "age")
df2: org.apache.spark.sql.DataFrame = [id: int, age: int]

scala> df2.show
+---+---+
| id|age|
+---+---+
|  1| 24|
|  2| 23|
|  3| 24|
+---+---+

这里有一个不正确的解决方案，将联接列定义为谓词

df1("id") === df2("id")

错误的结果是id列在联接的数据帧中重复：

scala> df1.join(df2, df1("id") === df2("id"), "left").show
+---+--------+---+---+
| id|    name| id|age|
+---+--------+---+---+
|  1|  mahesh|  1| 24|
|  2|shivangi|  2| 23|
|  3|   manoj|  3| 24|
+---+--------+---+---+

正确的解决方案是将联接列定义为字符串Seq（“id”）的数组，而不是表达式。然后，联接的dataframe没有重复的列

scala> df1.join(df2, Seq("id"),"left").show
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|  mahesh| 24|
|  2|shivangi| 23|
|  3|   manoj| 24|
+---+--------+---+

有关详细信息，请参阅自Spark 1.4.0以来的

，您可以通过两种方式使用join，列或JOINEXPR。使用第一种方式时，联接列在输出中只出现一次

/**
 * Inner equi-join with another [[DataFrame]] using the given columns.
 *
 * Different from other join functions, the join columns will only appear once in the output,
 * i.e. similar to SQL's `JOIN USING` syntax.
 *
 * {{{
 *   // Joining df1 and df2 using the columns "user_id" and "user_name"
 *   df1.join(df2, Seq("user_id", "user_name"))
 * }}}
 *
 * Note that if you perform a self-join using this function without aliasing the input
 * [[DataFrame]]s, you will NOT be able to reference any columns after the join, since
 * there is no way to disambiguate which side of the join you would like to reference.
 *
 * @param right Right side of the join operation.
 * @param usingColumns Names of the columns to join on. This columns must exist on both sides.
 * @group dfops
 * @since 1.4.0
 */

def join(right: DataFrame, usingColumns: Seq[String]): DataFrame = {
  join(right, usingColumns, "inner")
}

自Spark 1.4.0以来，您可以通过两种方式使用join，列或JoinExpr。使用第一种方式时，联接列在输出中只出现一次

/**
 * Inner equi-join with another [[DataFrame]] using the given columns.
 *
 * Different from other join functions, the join columns will only appear once in the output,
 * i.e. similar to SQL's `JOIN USING` syntax.
 *
 * {{{
 *   // Joining df1 and df2 using the columns "user_id" and "user_name"
 *   df1.join(df2, Seq("user_id", "user_name"))
 * }}}
 *
 * Note that if you perform a self-join using this function without aliasing the input
 * [[DataFrame]]s, you will NOT be able to reference any columns after the join, since
 * there is no way to disambiguate which side of the join you would like to reference.
 *
 * @param right Right side of the join operation.
 * @param usingColumns Names of the columns to join on. This columns must exist on both sides.
 * @group dfops
 * @since 1.4.0
 */

def join(right: DataFrame, usingColumns: Seq[String]): DataFrame = {
  join(right, usingColumns, "inner")
}