来自2个数据帧的Spark scala列级不匹配
我有两个数据帧 我想找出列级别的差异 输出应该是来自2个数据帧的Spark scala列级不匹配,scala,apache-spark,difference,Scala,Apache Spark,Difference,我有两个数据帧 我想找出列级别的差异 输出应该是 id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2 1, 1 ,1 , 0 , 6 ,6 ,0 2, 10 ,5 , 5 , 8 ,4 ,4 3, 6 ,3 ,
id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1 ,1 , 0 , 6 ,6 ,0
2, 10 ,5 , 5 , 8 ,4 ,4
3, 6 ,3 , 1 , 4 ,1 ,3
像wise一样,我有100个列,希望计算两个数据帧中相同列之间的差异列是动态的也许这会有帮助:
val spark = SparkSession.builder.appName("Test").master("local[*]").getOrCreate();
import spark.implicits._
var df1 = Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4")).toDF("id", "value1", "value2")
var df2 = Seq((1, "1", "6"), (2, "5", "4"), (3, "3", "1")).toDF("id", "value1", "value2")
df1.columns.foreach(column => {
df1 = df1.withColumn(column, df1.col(column).cast(IntegerType))
})
df2.columns.foreach(column => {
df2 = df2.withColumn(column, df2.col(column).cast(IntegerType))
})
df1 = df1.withColumnRenamed("id", "df1_id")
df2 = df2.withColumnRenamed("id", "df2_id")
df1.show()
df2.show()
到现在为止,你有两个数据帧,分别是x值,y值,z值,然后继续
df1:
df2:
现在我们将根据id加入他们:
var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
最后,我们将获取df1/df2上的所有列*重要的是,它们将具有相同的列-没有id,并创建一个新的差异列:
df1.columns.tail.foreach(col => {
val new_col_name = s"${col}-diff"
val df_a_col = s"df1.${col}"
val df_b_col = s"df2.${col}"
df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
})
df3.show()
结果:
+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
| 1| 1| 6| 1| 1| 6| 0| 0|
| 2| 10| 8| 2| 5| 4| 5| 4|
| 3| 6| 4| 3| 3| 1| 3| 3|
+------+------+------+------+------+------+-----------+-----------+
这是结果,它是动态的,因此您可以添加所需的valueX。首先,将字符串转换为需要在字符串值English number到实际数之间映射的整数,例如six=6。将所有内容转换为整数后,就很容易了,只需通过id连接两个DataFrame,并使用.withColumn方法创建两个新列,该方法在两列上创建sub。列是动态的,取决于输入数据。两个dataframes的列数和名称相同?是的,名称相同,工作正常,我在部件中添加的另一件事是排序
var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
df1.columns.tail.foreach(col => {
val new_col_name = s"${col}-diff"
val df_a_col = s"df1.${col}"
val df_b_col = s"df2.${col}"
df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
})
df3.show()
+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
| 1| 1| 6| 1| 1| 6| 0| 0|
| 2| 10| 8| 2| 5| 4| 5| 4|
| 3| 6| 4| 3| 3| 1| 3| 3|
+------+------+------+------+------+------+-----------+-----------+