Apache spark 如何计算两个数据帧'；spark中带有scala的值_Apache Spark_Apache Spark Sql

Apache spark 如何计算两个数据帧'；spark中带有scala的值

apache-spark

Apache spark 如何计算两个数据帧'；spark中带有scala的值,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有两个数据帧，两个数据帧的计数相同，我想得到两个数据帧中每个值的和这是输入： +---+ and +---+ |df1| |df2| +---+ +---+ | 11| | 1| | 12| | 2| | 13| | 3| | 14| | 4| | 15| | 5| | 16| | 6| |

我有两个数据帧，两个数据帧的计数相同，我想得到两个数据帧中每个值的和这是输入：

+---+  and       +---+
|df1|            |df2|
+---+            +---+
| 11|            |  1|
| 12|            |  2|
| 13|            |  3|
| 14|            |  4|
| 15|            |  5|
| 16|            |  6|
| 17|            |  7|
| 18|            |  8|
| 19|            |  9|
| 20|            | 10|
+---+            +---+

这是我的代码：

val df1 = sc.parallelize(1 to 10,2).toDF("df1")
    val df2 = sc.parallelize(11 to 20,2).toDF("df2")
    val df3=df1.rdd.zip(df2.rdd).map(x=>{
      x._1.getInt(0)+x._2.getInt(0)
    }).toDF("result")
    df3.show()

结果是：

+-----+
|result|
+-----+
|   12|
|   14|
|   16|
|   18|
|   20|
|   22|
|   24|
|   26|
|   28|
|   30|
+-----+

我必须将数据帧更改为rdd，然后压缩两个rdd，如何在不转换为rdd的情况下计算两个数据帧？

您可以简单地使用

窗口

函数创建

行数

来连接两个

数据帧

。连接后，只需对两列求和

import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df1 = sc.parallelize(1 to 10,2).toDF("df1")
val df2 = sc.parallelize(11 to 20,2).toDF("df2")

df1.withColumn("rowNo", row_number() over Window.orderBy("df1"))
  .join(df2.withColumn("rowNo", row_number() over Window.orderBy("df2")), Seq("rowNo"))
  .select(($"df1"+$"df2").alias("result"))
  .show(false)

您可以使用为dataframe和join提供一个id，并添加两列

import spark.implicits._
val df1 = spark.sparkContext.parallelize(11 to 20).toDF("df1")
val df2 = spark.sparkContext.parallelize((1 to 10 )).toDF("df2")

df1.withColumn("id", monotonically_increasing_id())
  .join(df2.withColumn("id", monotonically_increasing_id()), "id")
  .withColumn("result", ($"df1" + $"df2")).drop("id").show

输出：

+---+---+------+
|df1|df2|result|
+---+---+------+
| 11|  1|    12|
| 18|  8|    26|
| 17|  7|    24|
| 20| 10|    30|
| 16|  6|    22|
| 12|  2|    14|
| 14|  4|    18|
| 19|  9|    28|
| 13|  3|    16|
| 15|  5|    20|
+---+---+------+

希望这有帮助

有没有更有效的方法呢？窗口会改变分区，如果我使用窗口，我必须在计算时重新分区是的，窗口函数肯定会改变分区。在不将所有数据累积到一个分区中的情况下，如何确保来自df2的11个和来自df1的1个在同一个执行器上。如果不确保所有分区都在同一个执行器上，我们就不能连接两个数据帧。你的要求是这样的。我希望这有帮助！