Apache spark dropDuplicate和filter duplicate在另一个数据帧中_Apache Spark_Apache Spark Sql_Spark Streaming

Apache spark dropDuplicate和filter duplicate在另一个数据帧中

apache-spark

Apache spark dropDuplicate和filter duplicate在另一个数据帧中,apache-spark,apache-spark-sql,spark-streaming,Apache Spark,Apache Spark Sql,Spark Streaming,我有一个具有100K条记录的数据帧，我想在一列的基础上删除重复记录，然后在不同的数据帧中过滤删除的记录，为此我使用以下逻辑：- val df= sqlContext.read.json("/hdfs/demo/bulk/file1.json") val uniqueDF= df.dropDuplicate(Seq("rank")) // rank column contain some duplicate values and id column is u

我有一个具有100K条记录的数据帧，我想在一列的基础上删除重复记录，然后在不同的数据帧中过滤删除的记录，为此我使用以下逻辑：-

val df= sqlContext.read.json("/hdfs/demo/bulk/file1.json")
val uniqueDF= df.dropDuplicate(Seq("rank")) // rank column contain some duplicate values and id column is unique
uniqueDF.write.format("com.databricks.spark.csv").option("header", "true")
      .option("delimiter", "|").save("/hdfs/records/unique/")
// then ffitering dropped records
val dupeDF=df.except(uniqueDF)

但是这个dupeDF包含一些重复的

id

值，而my

df

包含唯一的

id

值。所以，当我在纱线集群中运行此代码时，除了给我不正确的结果外，它还是在独立模式下运行的。我使用的是spark 1.6.2版本