Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何在Spark 2.3.0中减去两个保持重复的数据帧_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 如何在Spark 2.3.0中减去两个保持重复的数据帧

Apache spark 如何在Spark 2.3.0中减去两个保持重复的数据帧,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,Spark 2.4.0引入了新的便捷功能exceptAll,该功能允许减去两个数据帧,保留重复数据 范例 val df1 = Seq( ("a", 1L), ("a", 1L), ("a", 1L), ("b", 2L) ).toDF("id", "value") val df2 = Seq( ("a", 1L), ("b", 2L) ).toDF("id", "value") df1.exceptAll(df2).collect

Spark 2.4.0引入了新的便捷功能
exceptAll
,该功能允许减去两个数据帧,保留重复数据

范例

  val df1 = Seq(
    ("a", 1L),
    ("a", 1L),
    ("a", 1L),
    ("b", 2L)
  ).toDF("id", "value")
  val df2 = Seq(
    ("a", 1L),
    ("b", 2L)
  ).toDF("id", "value")

df1.exceptAll(df2).collect()
// will return

Seq(("a", 1L),("a", 1L))
但是,我只能使用Spark 2.3.0


仅使用Spark 2.3.0中的函数实现此功能的最佳方法是什么

一个选项是使用
row\u number
生成序列号列,并在
左连接上使用它来获取缺少的行

此处显示的Pypark解决方案

 from pyspark.sql.functions import row_number
 from pyspark.sql import Window
 w1 = Window.partitionBy(df1.id).orderBy(df1.value)
 w2 = Window.partitionBy(df2.id).orderBy(df2.value)
 df1 = df1.withColumn("rnum", row_number().over(w1))
 df2 = df2.withColumn("rnum", row_number().over(w2))
 res_like_exceptAll = df1.join(df2, (df1.id==df2.id) & (df1.val == df2.val) & (df1.rnum == df2.rnum), 'left') \
                         .filter(df2.id.isNull()) \ #Identifies missing rows 
                         .select(df1.id,df1.value)
 res_like_exceptAll.show()