Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/276.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 两个数据帧的相等性_Python_Scala_Apache Spark_Databricks - Fatal编程技术网

Python 两个数据帧的相等性

Python 两个数据帧的相等性,python,scala,apache-spark,databricks,Python,Scala,Apache Spark,Databricks,我有以下情况: 我有两个数据帧,只包含一列 DF1=(1,2,3,4,5) DF2=(3,6,7,8,9,10) 基本上,这些值是键,如果DF1中的键不在DF2中,我将创建DF1的拼花文件(在当前示例中,它应该返回false)。我目前实现要求的方式是: val df1count= DF1.count val df2count=DF2.count val diffDF=DF2.except(DF1) val diffCount=diffDF.count if(diffCount==(df2cou

我有以下情况:

我有两个数据帧,只包含一列

DF1=(1,2,3,4,5)
DF2=(3,6,7,8,9,10)
基本上,这些值是键,如果DF1中的键不在DF2中,我将创建DF1的拼花文件(在当前示例中,它应该返回false)。我目前实现要求的方式是:

val df1count= DF1.count
val df2count=DF2.count
val diffDF=DF2.except(DF1)
val diffCount=diffDF.count
if(diffCount==(df2count-df1count)) true
else false

这种方法的问题是我调用了4次动作元素,这肯定不是最好的方法。有人能告诉我实现这一点的最佳有效方法吗?

您可以使用以下功能:

import org.apache.spark.sql.functions._

def diff(key: String, df1: DataFrame, df2: DataFrame): DataFrame = {
  val fields = df1.schema.fields.map(_.name)
  val diffColumnName = "Diff"

  df1
    .join(df2, df1(key) === df2(key), "full_outer")
    .withColumn(
      diffColumnName,
      when(df1(key).isNull, "New row in DataFrame 2")
        .otherwise(
          when(df2(key).isNull, "New row in DataFrame 1")
            .otherwise(
              concat_ws("",
                fields.map(f => when(df1(f) =!= df2(f), s"$f ").otherwise("")):_*
              )
            )
        )
    )
    .filter(col(diffColumnName) =!= "")
    .select(
      fields.map(f =>
        when(df1(key).isNotNull, df1(f)).otherwise(df2(f)).alias(f)
      ) :+ col(diffColumnName):_*
    )
}
在您的情况下,运行以下命令:

diff("emp_id", df1, df2)
示例

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

object DiffDataFrames extends App {
  val session = SparkSession.builder().master("local").getOrCreate()

  import session.implicits._

  val df1 = session.createDataset(Seq((1,"a",11),(2,"b",2),(3,"c",33),(5,"e",5))).toDF("n", "s", "i")
  val df2 = session.createDataset(Seq((1,"a",11),(2,"bb",2),(3,"cc",34),(4,"d",4))).toDF("n", "s", "i")

  def diff(key: String, df1: DataFrame, df2: DataFrame): DataFrame =
  /* above definition */

  diff("n", df1, df2).show(false)
}

以下是获取两个数据帧之间不常见行的方法:

val d1 = Seq((3, "Chennai", "rahman", "9848022330", 45000, "SanRamon"), (1, "Hyderabad", "ram", "9848022338", 50000, "SF"), (2, "Hyderabad", "robin", "9848022339", 40000, "LA"), (4, "sanjose", "romin", "9848022331", 45123, "SanRamon"))
val d2 = Seq((3, "Chennai", "rahman", "9848022330", 45000, "SanRamon"), (1, "Hyderabad", "ram", "9848022338", 50000, "SF"), (2, "Hyderabad", "robin", "9848022339", 40000, "LA"), (4, "sanjose", "romin", "9848022331", 45123, "SanRamon"), (4, "sanjose", "romino", "9848022331", 45123, "SanRamon"), (5, "LA", "Test", "1234567890", 12345, "Testuser"))

val df1 = d1.toDF("emp_id" ,"emp_city" ,"emp_name" ,"emp_phone" ,"emp_sal" ,"emp_site")
val df2 = d2.toDF("emp_id" ,"emp_city" ,"emp_name" ,"emp_phone" ,"emp_sal" ,"emp_site")

spark.sql("((select * from df1) union (select * from df2)) minus ((select * from df1) intersect (select * from df2))").show //spark is SparkSession

你能告诉我如何申报df1和df2吗。我在下面声明了类似的sqlContext=sqlContext(sc)df=sqlContext.sql(“从表1中选择*”)df2=sqlContext.sql(“从表2中选择*”),然后按原样处理上述代码。。。。正在获取语法错误。。。。我对spark scala Codec非常陌生,你能纠正我的错误吗?当我尝试运行以下代码时,我得到一个错误:未找到:值df1,未找到df2。。导入org.apache.spark.sql.{DataFrame,SQLContext}导入org.apache.spark.sql.functions.\uval sc:SparkContext val SQLContext val=new org.apache.spark.sql.SQLContext(sc)SQLContext=SQLContext(sc)df1=SQLContext.sql(“从表1中选择*”)df2=SQLContext.sql(“从表2中选择*”)diff(“租户”,df1,df2)def diff(key:String,df1:DataFrame,df2:DataFrame):DataFrame={……}///diff-fun提供的代码hi,我添加了一个简短的示例。当数据帧没有key\u列时,如何连接df1和df2的多个列。如何更新上面的diff函数以处理多个键。