Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
来自2个数据帧的Spark scala列级不匹配_Scala_Apache Spark_Difference - Fatal编程技术网

来自2个数据帧的Spark scala列级不匹配

来自2个数据帧的Spark scala列级不匹配,scala,apache-spark,difference,Scala,Apache Spark,Difference,我有两个数据帧 我想找出列级别的差异 输出应该是 id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2 1, 1 ,1 , 0 , 6 ,6 ,0 2, 10 ,5 , 5 , 8 ,4 ,4 3, 6 ,3 ,

我有两个数据帧

我想找出列级别的差异 输出应该是

id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1        ,1           ,  0         , 6         ,6         ,0
2, 10       ,5           ,  5         , 8         ,4         ,4
3, 6        ,3           ,  1         , 4         ,1         ,3
像wise一样,我有100个列,希望计算两个数据帧中相同列之间的差异列是动态的

也许这会有帮助:

  val spark = SparkSession.builder.appName("Test").master("local[*]").getOrCreate();

  import spark.implicits._

  var df1 = Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4")).toDF("id", "value1", "value2")
  var df2 = Seq((1, "1", "6"), (2, "5", "4"), (3, "3", "1")).toDF("id", "value1", "value2")

  df1.columns.foreach(column => {
    df1 = df1.withColumn(column, df1.col(column).cast(IntegerType))
  })
  df2.columns.foreach(column => {
    df2 = df2.withColumn(column, df2.col(column).cast(IntegerType))
  })

  df1 = df1.withColumnRenamed("id", "df1_id")
  df2 = df2.withColumnRenamed("id", "df2_id")

  df1.show()
  df2.show()
到现在为止,你有两个数据帧,分别是x值,y值,z值,然后继续

df1:

df2:

现在我们将根据id加入他们:

  var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
最后,我们将获取df1/df2上的所有列*重要的是,它们将具有相同的列-没有id,并创建一个新的差异列:

  df1.columns.tail.foreach(col => {
    val new_col_name = s"${col}-diff"
    val df_a_col = s"df1.${col}"
    val df_b_col = s"df2.${col}"
    df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
  })

  df3.show()
结果:

+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
|     1|     1|     6|     1|     1|     6|          0|          0|
|     2|    10|     8|     2|     5|     4|          5|          4|
|     3|     6|     4|     3|     3|     1|          3|          3|
+------+------+------+------+------+------+-----------+-----------+

这是结果,它是动态的,因此您可以添加所需的valueX。

首先,将字符串转换为需要在字符串值English number到实际数之间映射的整数,例如six=6。将所有内容转换为整数后,就很容易了,只需通过id连接两个DataFrame,并使用.withColumn方法创建两个新列,该方法在两列上创建sub。列是动态的,取决于输入数据。两个dataframes的列数和名称相同?是的,名称相同,工作正常,我在部件中添加的另一件事是排序
  var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
  df1.columns.tail.foreach(col => {
    val new_col_name = s"${col}-diff"
    val df_a_col = s"df1.${col}"
    val df_b_col = s"df2.${col}"
    df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
  })

  df3.show()
+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
|     1|     1|     6|     1|     1|     6|          0|          0|
|     2|    10|     8|     2|     5|     4|          5|          4|
|     3|     6|     4|     3|     3|     1|          3|          3|
+------+------+------+------+------+------+-----------+-----------+