Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何使用Spark比较两个表的列?_Scala_Apache Spark_Hadoop_Apache Spark Sql - Fatal编程技术网

Scala 如何使用Spark比较两个表的列?

Scala 如何使用Spark比较两个表的列?,scala,apache-spark,hadoop,apache-spark-sql,Scala,Apache Spark,Hadoop,Apache Spark Sql,我试图通过读取数据帧来比较两个表()。对于这些表中的每个公共列,使用主键(如order\u id)与其他列(如order\u date、order\u name、order\u event)连接 我正在使用的Scala代码 val primary_key=order_id for (i <- commonColumnsList){ val column_name = i val tempDataFrameForNew = newDataFrame.selectExp

我试图通过读取数据帧来比较两个表()。对于这些表中的每个公共列,使用主键(如order\u id)与其他列(如order\u date、order\u name、order\u event)连接

我正在使用的Scala代码

val primary_key=order_id
for (i <- commonColumnsList){
      val column_name = i
      val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
      val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")

      //Get those records which aren common in both old/new tables
      matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
      //Get those records which aren't common in both old/new tables
      nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)

      //Total Null/Non-Null Counts in both old and new tables.
      nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
      nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
      nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
      nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame

      //Put the result for a given column in a Seq variable, later convert it to Dataframe.
      tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
       (nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
     // Final Step: Create DataFrame using Seq and some Schema.
     spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)

val primary\u key=order\u id

对于(i您可以执行以下操作:
1.外部连接priamary键上的新旧数据帧
joined\u df=df\u old.join(df\u new,主键,“outer”)
2.尽可能将其缓存。这将为您节省大量时间
3.现在您可以使用spark函数迭代列并比较列(
.isNull
表示不匹配,
=
表示匹配等)


用于(col Hey Paul,谢谢你的回答。但是我有点困惑,或者我的问题没有被正确地框起来。如果我做了外部连接,DF1和DF2中的列除了主键之外不应该匹配,不是吗?但是在我的旧/新数据帧中,我有相同的列名。我想找出可用列的区别在两个数据帧中都有标签。不,我是指主键列上的外部联接
val joined\u df=df\u new.join(df\u old,primary\u key,“outer”)
对不起,先生,我的错误,好的,在主键上进行一个完整的外部联接,然后作为,,.进行迭代。请纠正我的理解。我为迭代添加了一些伪代码。想法仍然是只在列上进行迭代(主键列除外)。不需要将键与列连在一起