在scala spark中合并两个数据帧
我有两个数据帧: dataframe1:在scala spark中合并两个数据帧,scala,apache-spark,Scala,Apache Spark,我有两个数据帧: dataframe1: +-----++-----++-------------+ | id || name| has_bank_acc | +-----++-----++-------------+ | 0|| qwe|| true | | 1|| asd|| false | | 2|| rty|| false | | 3|| tyu|| true | +-----++-----++----
+-----++-----++-------------+
| id || name| has_bank_acc |
+-----++-----++-------------+
| 0|| qwe|| true |
| 1|| asd|| false |
| 2|| rty|| false |
| 3|| tyu|| true |
+-----++-----++-------------+
数据框架2:
+-----++-----++--------------+
| id || name| has_email_acc |
+-----++-----++--------------+
| 0|| qwe|| true |
| 5|| hjk|| false |
| 8|| oiu|| false |
| 7|| nmb|| true |
+-----++-----++--------------+
我必须合并这些数据帧以获得以下内容:
+-----++-----++-------------++---------------+
| id || name| has_bank_acc || has_email_acc |
+-----++-----++-------------++---------------+
| 0|| qwe|| true | null |
| 1|| asd|| false | null |
| 2|| rty|| false | null |
| 3|| tyu|| true | null |
| 0|| qwe|| null | true |
| 5|| hjk|| null | false |
| 8|| oiu|| null | false |
| 7|| nmb|| null | true |
+-----++-----++-------------+----------------+
我已尝试联合并加入,但未成功您无法对不同的列执行联合。如果您添加缺少的列并传递null,那么它将给出数据类型错误。
所以唯一的解决办法就是加入
scala> df1.show()
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
| 0| qwe| true|
| 1| asd| false|
| 2| rty| false|
| 3| tyu| true|
+---+----+------------+
scala> df2.show()
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
| 0| qwe| true |
| 5| hjk| false |
| 8| oiu| false |
| 7| nmb| true |
+---+----+-------------+
scala> val df11 = df1.withColumn("fid", lit(1))
scala> val df22 = df1.withColumn("fid", lit(2))
scala> df11.alias("1").join(df22.alias("2"), List("fid", "id", "name"),"full").drop("fid").show()
+---+----+------------+------------+
| id|name|has_bank_acc|has_bank_acc|
+---+----+------------+------------+
| 0| qwe| true| null|
| 1| asd| false| null|
| 2| rty| false| null|
| 3| tyu| true| null|
| 0| qwe| null| true|
| 1| asd| null| false|
| 2| rty| null| false|
| 3| tyu| null| true|
+---+----+------------+------------+
解决办法可以是:
scala> df1.show
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
| 0| qwe| true|
| 1| asd| false|
| 2| rty| false|
| 3| tyu| true|
+---+----+------------+
scala> df2.show
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
| 0| qwe| true|
| 5| hjk| false|
| 8| oiu| false|
| 7| nmb| true|
+---+----+-------------+
scala> val cols1 = df1.columns.toSet
cols1: scala.collection.immutable.Set[String] = Set(id, name, has_bank_acc)
scala> val cols2 = df2.columns.toSet
cols2: scala.collection.immutable.Set[String] = Set(id, name, has_email_acc)
scala> val total = cols1 ++ cols2
total: scala.collection.immutable.Set[String] = Set(id, name, has_bank_acc, has_email_acc)
scala> def expr(myCols: Set[String], allCols: Set[String]) = {
| allCols.toList.map(x => x match {
| case x if myCols.contains(x) => col(x)
| case _ => lit(null).as(x)
| })
| }
expr: (myCols: Set[String], allCols: Set[String])List[org.apache.spark.sql.Column]
scala> df1.select(expr(cols1, total): _*).unionAll(df2.select(expr(cols2,total): _*)).show
warning: there was one deprecation warning; re-run with -deprecation for details
+---+----+------------+-------------+
| id|name|has_bank_acc|has_email_acc|
+---+----+------------+-------------+
| 0| qwe| true| null|
| 1| asd| false| null|
| 2| rty| false| null|
| 3| tyu| true| null|
| 0| qwe| null| true|
| 5| hjk| null| false|
| 8| oiu| null| false|
| 7| nmb| null| true|
+---+----+------------+-------------+
如果有帮助,请告诉我 UnionAll添加缺少的列有助于:
val data = Seq((0,"qwe","true"),(1,"asd","false"),(2,"rty","false"),(3,"tyu","true")).toDF("id","name","has_bank_acc")
scala> data.show
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
| 0| qwe| true|
| 1| asd| false|
| 2| rty| false|
| 3| tyu| true|
+---+----+------------+
val data2 = Seq((0,"qwe","true"),(5,"hjk","false"),(8,"oiu","false"),(7,"nmb","true")).toDF("id","name","has_email_acc")
scala> data2.show
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
| 0| qwe| true|
| 5| hjk| false|
| 8| oiu| false|
| 7| nmb| true|
+---+----+-------------+
val data_cols = data.columns
val data2_cols = data2.columns
val transformedData = data2_cols.diff(data_cols).foldLeft(data) {
case (df, (newCols)) =>
df.withColumn(newCols, lit("null"))
}
val transformedData2 = data_cols.diff(data2_cols).foldLeft(data2) {
case (df, (newCols)) =>
df.withColumn(newCols, lit("null"))
}
val finalData = transformedData2.unionByName(transformedData)
finalData.show
scala> finalData.show
+---+----+-------------+------------+
| id|name|has_email_acc|has_bank_acc|
+---+----+-------------+------------+
| 0| qwe| true| null|
| 5| hjk| false| null|
| 8| oiu| false| null|
| 7| nmb| true| null|
| 0| qwe| null| true|
| 1| asd| null| false|
| 2| rty| null| false|
| 3| tyu| null| true|
+---+----+-------------+------------+
dataframe1
.withColumn("has_email_acc", lit(null).cast(BooleanType))
.unionByName(dataframe2.withColumn("has_bank_acc", lit(null).cast(BooleanType)))
在执行联合和加入时出现了什么错误