spark scala数据帧合并多个数据帧
我收到了三份文件spark scala数据帧合并多个数据帧,scala,apache-spark,dataframe,merge,Scala,Apache Spark,Dataframe,Merge,我收到了三份文件 ## +---+----+----+---+ ## |pk1|pk2|val1|val2| ## +---+----+----+---+ ## | 1| aa| ab| ac| ## | 2| bb| bc| bd| ## +---+----+----+---+ ## +---+----+----+---+ ## |pk1|pk2|val1|val2| ## +---+----+----+---+ ## | 1| aa| ab| ad| ## | 2| b
## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## | 1| aa| ab| ac|
## | 2| bb| bc| bd|
## +---+----+----+---+
## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## | 1| aa| ab| ad|
## | 2| bb| bb| bd|
## +---+----+----+---+
## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## | 1| aa| ac| ad|
## | 2| bb| bc| bd|
## +---+----+----+---+
我需要比较前两个文件(我读作dataframe),只识别更改,然后与第三个文件合并,所以我的输出应该是
## +---+----+----+---+
## |pk1|pk2|val1|val2|
## +---+----+----+---+
## | 1| aa| ac| ad|
## | 2| bb| bb| bd|
## +---+----+----+---+
如何仅拾取已更改的列?更新另一个数据帧?我还不能发表评论,所以我将尝试解决这个问题。它可能仍需要修改。据我所知,您正在寻找最后一个独特的变化。所以Val1有{ab->ab->ac,bc->bb->bc},所以最终结果是{ac,bb},因为最后一个文件的bc在第一个文件中,因此不是唯一的。如果是这种情况,那么最好的处理方法是创建一个集合,并从集合中获取最后一个值。我将使用自定义项来完成这项工作 从你的例子来看:
val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ac"),(2,"bb","bc","bd"))).toDF("pk1","pk2","val1","val2")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ad"),(2,"bb","bb","bd"))).toDF("pk1","pk2","val1","val2")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","ac","ad"),(2,"bb","bc","bd"))).toDF("pk1","pk2","val1","val2")
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.UserDefinedFunction
import sqlContext.implicits._
def getChange: UserDefinedFunction =
udf((a: String, b: String, c: String) => Set(a,b,c).last)
df1
.join(df2,df1("pk1")===df2("pk1") && df1("pk2")===df2("pk2"), "inner")
.join(df3,df1("pk1")===df3("pk1") && df1("pk2")===df3("pk2"), "inner")
.select(df1("pk1"),df1("pk2"),
df1("val1").as("df1Val1"),df2("val1").as("df2Val1"),df3("val1").as("df3Val1"),
df1("val2").as("df1Val2"),df2("val2").as("df2Val2"),df3("val2").as("df3Val2"))
.withColumn("val1",getChange($"df1Val1",$"df2Val1",$"df3Val1"))
.withColumn("val2",getChange($"df1Val2",$"df2Val2",$"df3Val2"))
.select($"pk1",$"pk2",$"val1",$"val2")
.orderBy($"pk1")
.show(false)
这将产生:
+---+---+----+----+
|pk1|pk2|val1|val2|
+---+---+----+----+
|1 |aa |ac |ad |
|2 |bb |bb |bd |
+---+---+----+----+
+---+---+---+---+----+----+----+
|pk1|pk2|pk3|pk4|val1|val2|val3|
+---+---+---+---+----+----+----+
|1 |aa |c |d |ac |ad |ae |
|2 |bb |d |e |bb |bd |bg |
+---+---+---+---+----+----+----+
显然,如果您使用更多的列或更多的数据帧,那么写出来会有点麻烦,但是对于您的示例来说,这应该可以做到
编辑:这用于向混合中添加更多列。正如我所说,obove有点麻烦。这将遍历每一列,直到没有留下任何列
require(df1.columns.sameElements(df2.columns) && df1.columns.sameElements(df3.columns),"DF Columns do not match") //this is a check so may not be needed
val cols: Array[String] = df1.columns
def getChange: UserDefinedFunction = udf((a: String, b: String, c: String) => Set(a,b,c).last)
def createFrame(cols: Array[String], df1: DataFrame, df2: DataFrame, df3:DataFrame): scala.collection.mutable.ListBuffer[DataFrame] = {
val list: scala.collection.mutable.ListBuffer[DataFrame] = new scala.collection.mutable.ListBuffer[DataFrame]()
val keys = cols.slice(0,2) //get the keys
val columns = cols.slice(2, cols.length).toSeq //get the columns to use
def helper(columns: Seq[String]): scala.collection.mutable.ListBuffer[DataFrame] = {
if(columns.isEmpty) list
else {
list += df1
.join(df2, df1.col(keys(0)) === df2.col(keys(0)) && df1.col(keys(1)) === df2.col(keys(1)), "inner")
.join(df3, df1.col(keys(0)) === df3.col(keys(0)) && df1.col(keys(1)) === df3.col(keys(1)), "inner")
.select(df1.col(keys(0)), df1.col(keys(1)),
getChange(df1.col(columns.head), df2.col(columns.head), df3.col(columns.head)).as(columns.head))
helper(columns.tail) //use tail recursion
}
}
helper(columns)
}
val list: scala.collection.mutable.ListBuffer[DataFrame] = createFrame(cols, df1, df2, df3)
list.reduce((a,b) =>
a
.join(b,a(cols.head)===b(cols.head) && a(cols(1))===b(cols(1)),"inner")
.drop(b(cols.head))
.drop(b(cols(1))))
.select(cols.head, cols.tail: _*)
.orderBy(cols.head)
.show
有3个值列的示例,然后将这些值传递到上面的代码中:
val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ac","ad"),(2,"bb","bc","bd","bc"))).toDF("pk1","pk2","val1","val2","val3")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ad","ae"),(2,"bb","bb","bd","bf"))).toDF("pk1","pk2","val1","val2","val3")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","ac","ad","ae"),(2,"bb","bc","bd","bg"))).toDF("pk1","pk2","val1","val2","val3")
生成以下数据帧:
运行上面的代码可以得到:
//output
+---+---+----+----+----+
|pk1|pk2|val1|val2|val3|
+---+---+----+----+----+
| 1| aa| ac| ad| ae|
| 2| bb| bb| bd| bg|
+---+---+----+----+----+
也许还有一种更有效的方法可以做到这一点,但这是我一时兴起的想法
Edit2
要使用任意数量的键执行此操作,可以执行以下操作。启动时需要定义关键点的数量。这也可能被清理掉。我用4/5键实现了这一点,但您也应该运行一些测试,但它应该可以工作:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.UserDefinedFunction
val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ab","ac","ad"),(2,"bb","d","e","bc","bd","bc"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ab","ad","ae"),(2,"bb","d","e","bb","bd","bf"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ac","ad","ae"),(2,"bb","d","e","bc","bd","bg"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
require(df1.columns.sameElements(df2.columns) && df1.columns.sameElements(df3.columns),"DF Columns do not match")
val cols: Array[String] = df1.columns
def getChange: UserDefinedFunction = udf((a: String, b: String, c: String) => Set(a,b,c).last)
def createFrame(cols: Array[String], df1: DataFrame, df2: DataFrame, df3:DataFrame): scala.collection.mutable.ListBuffer[DataFrame] = {
val list: scala.collection.mutable.ListBuffer[DataFrame] = new scala.collection.mutable.ListBuffer[DataFrame]()
val keys = cols.slice(0,4)//get the keys
val columns = cols.slice(4, cols.length).toSeq //get the columns to use
def helper(columns: Seq[String]): scala.collection.mutable.ListBuffer[DataFrame] = {
if(columns.isEmpty) list
else {
list += df1
.join(df2, Seq(keys :_*), "inner")
.join(df3, Seq(keys :_*), "inner")
.withColumn(columns.head + "Out", getChange(df1.col(columns.head), df2.col(columns.head), df3.col(columns.head)))
.select(col(columns.head + "Out").as(columns.head) +: keys.map(x => df1.col(x)) : _*)
helper(columns.tail)
}
}
helper(columns)
}
val list: scala.collection.mutable.ListBuffer[DataFrame] = createFrame(cols, df1, df2, df3)
list.foreach(a => a.show(false))
val keys=cols.slice(0,4)
list.reduce((a,b) =>
a.alias("a").join(b.alias("b"),Seq(keys :_*),"inner")
.select("a.*","b." + b.columns.head))
.orderBy(cols.head)
.show(false)
这将产生:
+---+---+----+----+
|pk1|pk2|val1|val2|
+---+---+----+----+
|1 |aa |ac |ad |
|2 |bb |bb |bd |
+---+---+----+----+
+---+---+---+---+----+----+----+
|pk1|pk2|pk3|pk4|val1|val2|val3|
+---+---+---+---+----+----+----+
|1 |aa |c |d |ac |ad |ae |
|2 |bb |d |e |bb |bd |bg |
+---+---+---+---+----+----+----+
我还不能发表评论,所以我会努力解决这个问题。它可能仍需要修改。据我所知,您正在寻找最后一个独特的变化。所以Val1有{ab->ab->ac,bc->bb->bc},所以最终结果是{ac,bb},因为最后一个文件的bc在第一个文件中,因此不是唯一的。如果是这种情况,那么最好的处理方法是创建一个集合,并从集合中获取最后一个值。我将使用自定义项来完成这项工作 从你的例子来看:
val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ac"),(2,"bb","bc","bd"))).toDF("pk1","pk2","val1","val2")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ad"),(2,"bb","bb","bd"))).toDF("pk1","pk2","val1","val2")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","ac","ad"),(2,"bb","bc","bd"))).toDF("pk1","pk2","val1","val2")
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.UserDefinedFunction
import sqlContext.implicits._
def getChange: UserDefinedFunction =
udf((a: String, b: String, c: String) => Set(a,b,c).last)
df1
.join(df2,df1("pk1")===df2("pk1") && df1("pk2")===df2("pk2"), "inner")
.join(df3,df1("pk1")===df3("pk1") && df1("pk2")===df3("pk2"), "inner")
.select(df1("pk1"),df1("pk2"),
df1("val1").as("df1Val1"),df2("val1").as("df2Val1"),df3("val1").as("df3Val1"),
df1("val2").as("df1Val2"),df2("val2").as("df2Val2"),df3("val2").as("df3Val2"))
.withColumn("val1",getChange($"df1Val1",$"df2Val1",$"df3Val1"))
.withColumn("val2",getChange($"df1Val2",$"df2Val2",$"df3Val2"))
.select($"pk1",$"pk2",$"val1",$"val2")
.orderBy($"pk1")
.show(false)
这将产生:
+---+---+----+----+
|pk1|pk2|val1|val2|
+---+---+----+----+
|1 |aa |ac |ad |
|2 |bb |bb |bd |
+---+---+----+----+
+---+---+---+---+----+----+----+
|pk1|pk2|pk3|pk4|val1|val2|val3|
+---+---+---+---+----+----+----+
|1 |aa |c |d |ac |ad |ae |
|2 |bb |d |e |bb |bd |bg |
+---+---+---+---+----+----+----+
显然,如果您使用更多的列或更多的数据帧,那么写出来会有点麻烦,但是对于您的示例来说,这应该可以做到
编辑:这用于向混合中添加更多列。正如我所说,obove有点麻烦。这将遍历每一列,直到没有留下任何列
require(df1.columns.sameElements(df2.columns) && df1.columns.sameElements(df3.columns),"DF Columns do not match") //this is a check so may not be needed
val cols: Array[String] = df1.columns
def getChange: UserDefinedFunction = udf((a: String, b: String, c: String) => Set(a,b,c).last)
def createFrame(cols: Array[String], df1: DataFrame, df2: DataFrame, df3:DataFrame): scala.collection.mutable.ListBuffer[DataFrame] = {
val list: scala.collection.mutable.ListBuffer[DataFrame] = new scala.collection.mutable.ListBuffer[DataFrame]()
val keys = cols.slice(0,2) //get the keys
val columns = cols.slice(2, cols.length).toSeq //get the columns to use
def helper(columns: Seq[String]): scala.collection.mutable.ListBuffer[DataFrame] = {
if(columns.isEmpty) list
else {
list += df1
.join(df2, df1.col(keys(0)) === df2.col(keys(0)) && df1.col(keys(1)) === df2.col(keys(1)), "inner")
.join(df3, df1.col(keys(0)) === df3.col(keys(0)) && df1.col(keys(1)) === df3.col(keys(1)), "inner")
.select(df1.col(keys(0)), df1.col(keys(1)),
getChange(df1.col(columns.head), df2.col(columns.head), df3.col(columns.head)).as(columns.head))
helper(columns.tail) //use tail recursion
}
}
helper(columns)
}
val list: scala.collection.mutable.ListBuffer[DataFrame] = createFrame(cols, df1, df2, df3)
list.reduce((a,b) =>
a
.join(b,a(cols.head)===b(cols.head) && a(cols(1))===b(cols(1)),"inner")
.drop(b(cols.head))
.drop(b(cols(1))))
.select(cols.head, cols.tail: _*)
.orderBy(cols.head)
.show
有3个值列的示例,然后将这些值传递到上面的代码中:
val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ac","ad"),(2,"bb","bc","bd","bc"))).toDF("pk1","pk2","val1","val2","val3")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","ab","ad","ae"),(2,"bb","bb","bd","bf"))).toDF("pk1","pk2","val1","val2","val3")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","ac","ad","ae"),(2,"bb","bc","bd","bg"))).toDF("pk1","pk2","val1","val2","val3")
生成以下数据帧:
运行上面的代码可以得到:
//output
+---+---+----+----+----+
|pk1|pk2|val1|val2|val3|
+---+---+----+----+----+
| 1| aa| ac| ad| ae|
| 2| bb| bb| bd| bg|
+---+---+----+----+----+
也许还有一种更有效的方法可以做到这一点,但这是我一时兴起的想法
Edit2
要使用任意数量的键执行此操作,可以执行以下操作。启动时需要定义关键点的数量。这也可能被清理掉。我用4/5键实现了这一点,但您也应该运行一些测试,但它应该可以工作:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.UserDefinedFunction
val df1: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ab","ac","ad"),(2,"bb","d","e","bc","bd","bc"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
val df2: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ab","ad","ae"),(2,"bb","d","e","bb","bd","bf"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
val df3: DataFrame = sparkContext.parallelize(Seq((1,"aa","c","d","ac","ad","ae"),(2,"bb","d","e","bc","bd","bg"))).toDF("pk1","pk2","pk3","pk4","val1","val2","val3")
require(df1.columns.sameElements(df2.columns) && df1.columns.sameElements(df3.columns),"DF Columns do not match")
val cols: Array[String] = df1.columns
def getChange: UserDefinedFunction = udf((a: String, b: String, c: String) => Set(a,b,c).last)
def createFrame(cols: Array[String], df1: DataFrame, df2: DataFrame, df3:DataFrame): scala.collection.mutable.ListBuffer[DataFrame] = {
val list: scala.collection.mutable.ListBuffer[DataFrame] = new scala.collection.mutable.ListBuffer[DataFrame]()
val keys = cols.slice(0,4)//get the keys
val columns = cols.slice(4, cols.length).toSeq //get the columns to use
def helper(columns: Seq[String]): scala.collection.mutable.ListBuffer[DataFrame] = {
if(columns.isEmpty) list
else {
list += df1
.join(df2, Seq(keys :_*), "inner")
.join(df3, Seq(keys :_*), "inner")
.withColumn(columns.head + "Out", getChange(df1.col(columns.head), df2.col(columns.head), df3.col(columns.head)))
.select(col(columns.head + "Out").as(columns.head) +: keys.map(x => df1.col(x)) : _*)
helper(columns.tail)
}
}
helper(columns)
}
val list: scala.collection.mutable.ListBuffer[DataFrame] = createFrame(cols, df1, df2, df3)
list.foreach(a => a.show(false))
val keys=cols.slice(0,4)
list.reduce((a,b) =>
a.alias("a").join(b.alias("b"),Seq(keys :_*),"inner")
.select("a.*","b." + b.columns.head))
.orderBy(cols.head)
.show(false)
这将产生:
+---+---+----+----+
|pk1|pk2|val1|val2|
+---+---+----+----+
|1 |aa |ac |ad |
|2 |bb |bb |bd |
+---+---+----+----+
+---+---+---+---+----+----+----+
|pk1|pk2|pk3|pk4|val1|val2|val3|
+---+---+---+---+----+----+----+
|1 |aa |c |d |ac |ad |ae |
|2 |bb |d |e |bb |bd |bg |
+---+---+---+---+----+----+----+
我还可以通过创建数据帧作为临时视图,然后执行select case语句来实现这一点。像这样,
df1.createTempView("df1")
df2.createTempView("df2")
df3.createTempView("df3")
select case when df1.val1=df2.val1 and df1.val1<>df3.val1 then df3.val1 end
df1.createTempView(“df1”)
df2.createTempView(“df2”)
df3.createTempView(“df3”)
当df1.val1=df2.val1和df1.val1df3.val1时选择case,然后选择df3.val1结束
这要快得多。我也可以通过创建数据帧作为临时视图,然后执行select case语句来实现这一点。像这样,
df1.createTempView("df1")
df2.createTempView("df2")
df3.createTempView("df3")
select case when df1.val1=df2.val1 and df1.val1<>df3.val1 then df3.val1 end
df1.createTempView(“df1”)
df2.createTempView(“df2”)
df3.createTempView(“df3”)
当df1.val1=df2.val1和df1.val1df3.val1时选择case,然后选择df3.val1结束
这要快得多。我认为您需要更具体一点(有歧义),但您是否尝试过加入
呢?你可以在任何条件下加入(即使是!=
)。我可以知道什么是歧义吗?我可以加入pks,但那只会回报一切?我是说df1在df1.pk1=df2.pk1和df1.pk2=df2.pk2上连接df2?这是我应该加入的方式,这很好,但是要得到唯一修改的列?例如,当我加入第一个2时,我应该只得到pk1->1,pk2->aa,val2>ad和pk1->2,pk2->bb,val1->bb列val1
第二行中的第一个数据帧具有bc
,然后同一列和同一行中的第二个数据帧具有bb
,第三个数据帧具有bc
。那么,为什么您的最终数据帧具有bb
?这不是应该是bc
?请将所有3个数据帧作为3个不同的文件读取。我想比较前2个数据帧(文件),确定是否有更改,并只更新3个数据帧中的更改。因此,当我比较前2个时,我得到的val1是bb(这是一个变化),这个变化必须在最后一个数据帧中更新,因此我的最终结果应该是bb。我认为您需要更具体一点(有歧义),但您是否尝试了加入?你可以在任何条件下加入(即使是!=
)。我可以知道什么是歧义吗?我可以加入pks,但那只会回报一切?我是说df1在df1.pk1=df2.pk1和df1.pk2=df2.pk2上连接df2?这是我应该加入的方式,这很好,但是要得到唯一修改的列?例如,当我加入第一个2时,我应该只得到pk1->1,pk2->aa,val2>ad和pk1->2,pk2->bb,val1->bb列val1
第二行中的第一个数据帧具有bc
,然后同一列和同一行中的第二个数据帧具有bb
,第三个数据帧具有bc
。那么,为什么您的最终数据帧具有bb
?这不是应该是bc
?请将所有3个数据帧作为3个不同的文件读取。我想比较前2个数据帧(文件),确定是否有更改,并仅更新3个数据帧中的更改