spark scala数据帧合并多个数据帧

spark scala数据帧合并多个数据帧,scala,apache-spark,dataframe,merge,Scala,Apache Spark,Dataframe,Merge,我收到了三份文件 ## +---+----+----+---+ ## |pk1|pk2|val1|val2| ## +---+----+----+---+ ## | 1| aa| ab| ac| ## | 2| bb| bc| bd| ## +---+----+----+---+ ## +---+----+----+---+ ## |pk1|pk2|val1|val2| ## +---+----+----+---+ ## | 1| aa| ab| ad| ## | 2| b

我收到了三份文件

## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## |  1| aa|  ab|  ac|
## |  2| bb|  bc|  bd|
## +---+----+----+---+

## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## |  1| aa|  ab|  ad|
## |  2| bb|  bb|  bd|
## +---+----+----+---+

## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## |  1| aa|  ac|  ad|
## |  2| bb|  bc|  bd|
## +---+----+----+---+
我需要比较前两个文件(我读作dataframe),只识别更改,然后与第三个文件合并,所以我的输出应该是

## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## |  1| aa|  ac|  ad|
## |  2| bb|  bb|  bd|
## +---+----+----+---+

如何仅拾取已更改的列?更新另一个数据帧?

我还不能发表评论,所以我将尝试解决这个问题。它可能仍需要修改。据我所知,您正在寻找最后一个独特的变化。所以Val1有{ab->ab->ac,bc->bb->bc},所以最终结果是{ac,bb},因为最后一个文件的bc在第一个文件中,因此不是唯一的。如果是这种情况,那么最好的处理方法是创建一个集合,并从集合中获取最后一个值。我将使用自定义项来完成这项工作

从你的例子来看:

val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ac"),(2,"bb","bc","bd"))).toDF("pk1","pk2","val1","val2")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ad"),(2,"bb","bb","bd"))).toDF("pk1","pk2","val1","val2")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","ac","ad"),(2,"bb","bc","bd"))).toDF("pk1","pk2","val1","val2") 

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.UserDefinedFunction
import sqlContext.implicits._

def getChange: UserDefinedFunction = 
    udf((a: String, b: String, c: String) => Set(a,b,c).last)

df1
.join(df2,df1("pk1")===df2("pk1") && df1("pk2")===df2("pk2"), "inner")
.join(df3,df1("pk1")===df3("pk1") && df1("pk2")===df3("pk2"), "inner")
.select(df1("pk1"),df1("pk2"),
  df1("val1").as("df1Val1"),df2("val1").as("df2Val1"),df3("val1").as("df3Val1"),
  df1("val2").as("df1Val2"),df2("val2").as("df2Val2"),df3("val2").as("df3Val2"))
  .withColumn("val1",getChange($"df1Val1",$"df2Val1",$"df3Val1"))
  .withColumn("val2",getChange($"df1Val2",$"df2Val2",$"df3Val2"))
  .select($"pk1",$"pk2",$"val1",$"val2")
  .orderBy($"pk1")
.show(false)
这将产生:

+---+---+----+----+
|pk1|pk2|val1|val2|
+---+---+----+----+
|1  |aa |ac  |ad  |
|2  |bb |bb  |bd  |
+---+---+----+----+
+---+---+---+---+----+----+----+
|pk1|pk2|pk3|pk4|val1|val2|val3|
+---+---+---+---+----+----+----+
|1  |aa |c  |d  |ac  |ad  |ae  |
|2  |bb |d  |e  |bb  |bd  |bg  |
+---+---+---+---+----+----+----+
显然,如果您使用更多的列或更多的数据帧,那么写出来会有点麻烦,但是对于您的示例来说,这应该可以做到

编辑:
这用于向混合中添加更多列。正如我所说,obove有点麻烦。这将遍历每一列,直到没有留下任何列

require(df1.columns.sameElements(df2.columns) && df1.columns.sameElements(df3.columns),"DF Columns do not match") //this is a check so may not be needed

val cols: Array[String] = df1.columns

def getChange: UserDefinedFunction = udf((a: String, b: String, c: String) => Set(a,b,c).last)

def createFrame(cols: Array[String], df1: DataFrame, df2: DataFrame, df3:DataFrame): scala.collection.mutable.ListBuffer[DataFrame] = {

val list: scala.collection.mutable.ListBuffer[DataFrame] = new scala.collection.mutable.ListBuffer[DataFrame]()
val keys = cols.slice(0,2) //get the keys
val columns = cols.slice(2, cols.length).toSeq //get the columns to use

  def helper(columns: Seq[String]): scala.collection.mutable.ListBuffer[DataFrame] = {
    if(columns.isEmpty) list
    else {
      list += df1
        .join(df2, df1.col(keys(0)) === df2.col(keys(0)) && df1.col(keys(1)) === df2.col(keys(1)), "inner")
        .join(df3, df1.col(keys(0)) === df3.col(keys(0)) && df1.col(keys(1)) === df3.col(keys(1)), "inner")
        .select(df1.col(keys(0)), df1.col(keys(1)),
        getChange(df1.col(columns.head), df2.col(columns.head), df3.col(columns.head)).as(columns.head))

      helper(columns.tail) //use tail recursion
  }
}
  helper(columns)
}

val list: scala.collection.mutable.ListBuffer[DataFrame] = createFrame(cols, df1, df2, df3)

list.reduce((a,b) =>
  a
    .join(b,a(cols.head)===b(cols.head) && a(cols(1))===b(cols(1)),"inner")
    .drop(b(cols.head))
    .drop(b(cols(1))))
.select(cols.head, cols.tail: _*)
.orderBy(cols.head)
.show
有3个值列的示例,然后将这些值传递到上面的代码中:

val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ac","ad"),(2,"bb","bc","bd","bc"))).toDF("pk1","pk2","val1","val2","val3")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ad","ae"),(2,"bb","bb","bd","bf"))).toDF("pk1","pk2","val1","val2","val3")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","ac","ad","ae"),(2,"bb","bc","bd","bg"))).toDF("pk1","pk2","val1","val2","val3")
生成以下数据帧:

运行上面的代码可以得到:

//output
+---+---+----+----+----+
|pk1|pk2|val1|val2|val3|
+---+---+----+----+----+
|  1| aa|  ac|  ad|  ae|
|  2| bb|  bb|  bd|  bg|
+---+---+----+----+----+
也许还有一种更有效的方法可以做到这一点,但这是我一时兴起的想法

Edit2

要使用任意数量的键执行此操作,可以执行以下操作。启动时需要定义关键点的数量。这也可能被清理掉。我用4/5键实现了这一点,但您也应该运行一些测试,但它应该可以工作:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.UserDefinedFunction

val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ab","ac","ad"),(2,"bb","d","e","bc","bd","bc"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ab","ad","ae"),(2,"bb","d","e","bb","bd","bf"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ac","ad","ae"),(2,"bb","d","e","bc","bd","bg"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")

require(df1.columns.sameElements(df2.columns) && df1.columns.sameElements(df3.columns),"DF Columns do not match")

val cols: Array[String] = df1.columns

def getChange: UserDefinedFunction = udf((a: String, b: String, c: String) => Set(a,b,c).last)

def createFrame(cols: Array[String], df1: DataFrame, df2: DataFrame, df3:DataFrame): scala.collection.mutable.ListBuffer[DataFrame] = {

val list: scala.collection.mutable.ListBuffer[DataFrame] = new scala.collection.mutable.ListBuffer[DataFrame]()
val keys = cols.slice(0,4)//get the keys
val columns = cols.slice(4, cols.length).toSeq //get the columns to use

def helper(columns: Seq[String]): scala.collection.mutable.ListBuffer[DataFrame] = {

  if(columns.isEmpty) list
  else {
    list += df1
      .join(df2, Seq(keys :_*), "inner")
      .join(df3, Seq(keys :_*), "inner")
      .withColumn(columns.head + "Out", getChange(df1.col(columns.head), df2.col(columns.head), df3.col(columns.head)))
      .select(col(columns.head + "Out").as(columns.head) +: keys.map(x => df1.col(x)) : _*)

    helper(columns.tail)
  }
}

helper(columns)
}

val list: scala.collection.mutable.ListBuffer[DataFrame] = createFrame(cols, df1, df2, df3)
list.foreach(a => a.show(false))
val keys=cols.slice(0,4)

list.reduce((a,b) =>
  a.alias("a").join(b.alias("b"),Seq(keys :_*),"inner")
  .select("a.*","b." + b.columns.head))
  .orderBy(cols.head)
  .show(false)
这将产生:

+---+---+----+----+
|pk1|pk2|val1|val2|
+---+---+----+----+
|1  |aa |ac  |ad  |
|2  |bb |bb  |bd  |
+---+---+----+----+
+---+---+---+---+----+----+----+
|pk1|pk2|pk3|pk4|val1|val2|val3|
+---+---+---+---+----+----+----+
|1  |aa |c  |d  |ac  |ad  |ae  |
|2  |bb |d  |e  |bb  |bd  |bg  |
+---+---+---+---+----+----+----+

我还不能发表评论,所以我会努力解决这个问题。它可能仍需要修改。据我所知,您正在寻找最后一个独特的变化。所以Val1有{ab->ab->ac,bc->bb->bc},所以最终结果是{ac,bb},因为最后一个文件的bc在第一个文件中,因此不是唯一的。如果是这种情况,那么最好的处理方法是创建一个集合,并从集合中获取最后一个值。我将使用自定义项来完成这项工作

从你的例子来看:

val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ac"),(2,"bb","bc","bd"))).toDF("pk1","pk2","val1","val2")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ad"),(2,"bb","bb","bd"))).toDF("pk1","pk2","val1","val2")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","ac","ad"),(2,"bb","bc","bd"))).toDF("pk1","pk2","val1","val2") 

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.UserDefinedFunction
import sqlContext.implicits._

def getChange: UserDefinedFunction = 
    udf((a: String, b: String, c: String) => Set(a,b,c).last)

df1
.join(df2,df1("pk1")===df2("pk1") && df1("pk2")===df2("pk2"), "inner")
.join(df3,df1("pk1")===df3("pk1") && df1("pk2")===df3("pk2"), "inner")
.select(df1("pk1"),df1("pk2"),
  df1("val1").as("df1Val1"),df2("val1").as("df2Val1"),df3("val1").as("df3Val1"),
  df1("val2").as("df1Val2"),df2("val2").as("df2Val2"),df3("val2").as("df3Val2"))
  .withColumn("val1",getChange($"df1Val1",$"df2Val1",$"df3Val1"))
  .withColumn("val2",getChange($"df1Val2",$"df2Val2",$"df3Val2"))
  .select($"pk1",$"pk2",$"val1",$"val2")
  .orderBy($"pk1")
.show(false)
这将产生:

+---+---+----+----+
|pk1|pk2|val1|val2|
+---+---+----+----+
|1  |aa |ac  |ad  |
|2  |bb |bb  |bd  |
+---+---+----+----+
+---+---+---+---+----+----+----+
|pk1|pk2|pk3|pk4|val1|val2|val3|
+---+---+---+---+----+----+----+
|1  |aa |c  |d  |ac  |ad  |ae  |
|2  |bb |d  |e  |bb  |bd  |bg  |
+---+---+---+---+----+----+----+
显然,如果您使用更多的列或更多的数据帧,那么写出来会有点麻烦,但是对于您的示例来说,这应该可以做到

编辑:
这用于向混合中添加更多列。正如我所说,obove有点麻烦。这将遍历每一列,直到没有留下任何列

require(df1.columns.sameElements(df2.columns) && df1.columns.sameElements(df3.columns),"DF Columns do not match") //this is a check so may not be needed

val cols: Array[String] = df1.columns

def getChange: UserDefinedFunction = udf((a: String, b: String, c: String) => Set(a,b,c).last)

def createFrame(cols: Array[String], df1: DataFrame, df2: DataFrame, df3:DataFrame): scala.collection.mutable.ListBuffer[DataFrame] = {

val list: scala.collection.mutable.ListBuffer[DataFrame] = new scala.collection.mutable.ListBuffer[DataFrame]()
val keys = cols.slice(0,2) //get the keys
val columns = cols.slice(2, cols.length).toSeq //get the columns to use

  def helper(columns: Seq[String]): scala.collection.mutable.ListBuffer[DataFrame] = {
    if(columns.isEmpty) list
    else {
      list += df1
        .join(df2, df1.col(keys(0)) === df2.col(keys(0)) && df1.col(keys(1)) === df2.col(keys(1)), "inner")
        .join(df3, df1.col(keys(0)) === df3.col(keys(0)) && df1.col(keys(1)) === df3.col(keys(1)), "inner")
        .select(df1.col(keys(0)), df1.col(keys(1)),
        getChange(df1.col(columns.head), df2.col(columns.head), df3.col(columns.head)).as(columns.head))

      helper(columns.tail) //use tail recursion
  }
}
  helper(columns)
}

val list: scala.collection.mutable.ListBuffer[DataFrame] = createFrame(cols, df1, df2, df3)

list.reduce((a,b) =>
  a
    .join(b,a(cols.head)===b(cols.head) && a(cols(1))===b(cols(1)),"inner")
    .drop(b(cols.head))
    .drop(b(cols(1))))
.select(cols.head, cols.tail: _*)
.orderBy(cols.head)
.show
有3个值列的示例,然后将这些值传递到上面的代码中:

val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ac","ad"),(2,"bb","bc","bd","bc"))).toDF("pk1","pk2","val1","val2","val3")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ad","ae"),(2,"bb","bb","bd","bf"))).toDF("pk1","pk2","val1","val2","val3")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","ac","ad","ae"),(2,"bb","bc","bd","bg"))).toDF("pk1","pk2","val1","val2","val3")
生成以下数据帧:

运行上面的代码可以得到:

//output
+---+---+----+----+----+
|pk1|pk2|val1|val2|val3|
+---+---+----+----+----+
|  1| aa|  ac|  ad|  ae|
|  2| bb|  bb|  bd|  bg|
+---+---+----+----+----+
也许还有一种更有效的方法可以做到这一点,但这是我一时兴起的想法

Edit2

要使用任意数量的键执行此操作,可以执行以下操作。启动时需要定义关键点的数量。这也可能被清理掉。我用4/5键实现了这一点,但您也应该运行一些测试,但它应该可以工作:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.UserDefinedFunction

val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ab","ac","ad"),(2,"bb","d","e","bc","bd","bc"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ab","ad","ae"),(2,"bb","d","e","bb","bd","bf"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ac","ad","ae"),(2,"bb","d","e","bc","bd","bg"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")

require(df1.columns.sameElements(df2.columns) && df1.columns.sameElements(df3.columns),"DF Columns do not match")

val cols: Array[String] = df1.columns

def getChange: UserDefinedFunction = udf((a: String, b: String, c: String) => Set(a,b,c).last)

def createFrame(cols: Array[String], df1: DataFrame, df2: DataFrame, df3:DataFrame): scala.collection.mutable.ListBuffer[DataFrame] = {

val list: scala.collection.mutable.ListBuffer[DataFrame] = new scala.collection.mutable.ListBuffer[DataFrame]()
val keys = cols.slice(0,4)//get the keys
val columns = cols.slice(4, cols.length).toSeq //get the columns to use

def helper(columns: Seq[String]): scala.collection.mutable.ListBuffer[DataFrame] = {

  if(columns.isEmpty) list
  else {
    list += df1
      .join(df2, Seq(keys :_*), "inner")
      .join(df3, Seq(keys :_*), "inner")
      .withColumn(columns.head + "Out", getChange(df1.col(columns.head), df2.col(columns.head), df3.col(columns.head)))
      .select(col(columns.head + "Out").as(columns.head) +: keys.map(x => df1.col(x)) : _*)

    helper(columns.tail)
  }
}

helper(columns)
}

val list: scala.collection.mutable.ListBuffer[DataFrame] = createFrame(cols, df1, df2, df3)
list.foreach(a => a.show(false))
val keys=cols.slice(0,4)

list.reduce((a,b) =>
  a.alias("a").join(b.alias("b"),Seq(keys :_*),"inner")
  .select("a.*","b." + b.columns.head))
  .orderBy(cols.head)
  .show(false)
这将产生:

+---+---+----+----+
|pk1|pk2|val1|val2|
+---+---+----+----+
|1  |aa |ac  |ad  |
|2  |bb |bb  |bd  |
+---+---+----+----+
+---+---+---+---+----+----+----+
|pk1|pk2|pk3|pk4|val1|val2|val3|
+---+---+---+---+----+----+----+
|1  |aa |c  |d  |ac  |ad  |ae  |
|2  |bb |d  |e  |bb  |bd  |bg  |
+---+---+---+---+----+----+----+

我还可以通过创建数据帧作为临时视图,然后执行select case语句来实现这一点。像这样,

df1.createTempView("df1")
df2.createTempView("df2")
df3.createTempView("df3")

select case when df1.val1=df2.val1 and df1.val1<>df3.val1 then df3.val1 end
df1.createTempView(“df1”)
df2.createTempView(“df2”)
df3.createTempView(“df3”)
当df1.val1=df2.val1和df1.val1df3.val1时选择case,然后选择df3.val1结束

这要快得多。

我也可以通过创建数据帧作为临时视图,然后执行select case语句来实现这一点。像这样,

df1.createTempView("df1")
df2.createTempView("df2")
df3.createTempView("df3")

select case when df1.val1=df2.val1 and df1.val1<>df3.val1 then df3.val1 end
df1.createTempView(“df1”)
df2.createTempView(“df2”)
df3.createTempView(“df3”)
当df1.val1=df2.val1和df1.val1df3.val1时选择case,然后选择df3.val1结束

这要快得多。

我认为您需要更具体一点(有歧义),但您是否尝试过加入
呢?你可以在任何条件下加入(即使是
!=
)。我可以知道什么是歧义吗?我可以加入pks,但那只会回报一切?我是说df1在df1.pk1=df2.pk1和df1.pk2=df2.pk2上连接df2?这是我应该加入的方式,这很好,但是要得到唯一修改的列?例如,当我加入第一个2时,我应该只得到pk1->1,pk2->aa,val2>ad和pk1->2,pk2->bb,val1->bb列
val1
第二行中的第一个数据帧具有
bc
,然后同一列和同一行中的第二个数据帧具有
bb
,第三个数据帧具有
bc
。那么,为什么您的最终数据帧具有
bb
?这不是应该是
bc
?请将所有3个数据帧作为3个不同的文件读取。我想比较前2个数据帧(文件),确定是否有更改,并只更新3个数据帧中的更改。因此,当我比较前2个时,我得到的val1是bb(这是一个变化),这个变化必须在最后一个数据帧中更新,因此我的最终结果应该是bb。我认为您需要更具体一点(有歧义),但您是否尝试了加入?你可以在任何条件下加入(即使是
!=
)。我可以知道什么是歧义吗?我可以加入pks,但那只会回报一切?我是说df1在df1.pk1=df2.pk1和df1.pk2=df2.pk2上连接df2?这是我应该加入的方式,这很好,但是要得到唯一修改的列?例如,当我加入第一个2时,我应该只得到pk1->1,pk2->aa,val2>ad和pk1->2,pk2->bb,val1->bb列
val1
第二行中的第一个数据帧具有
bc
,然后同一列和同一行中的第二个数据帧具有
bb
,第三个数据帧具有
bc
。那么,为什么您的最终数据帧具有
bb
?这不是应该是
bc
?请将所有3个数据帧作为3个不同的文件读取。我想比较前2个数据帧(文件),确定是否有更改,并仅更新3个数据帧中的更改