Apache spark 比较两个数据集并获得更改的字段
我正在使用Java开发spark,我将从api下载数据并与mongodb数据进行比较,而下载的json有15-20个字段,但数据库有300个字段 现在,我的任务是将下载的JSON与mongodb数据进行比较,并获取与过去数据更改的任何字段 样本数据集 从API下载的数据 Mongodb数据 由于列的长度,我不能使用except 预期产量Apache spark 比较两个数据集并获得更改的字段,apache-spark,apache-spark-dataset,Apache Spark,Apache Spark Dataset,我正在使用Java开发spark,我将从api下载数据并与mongodb数据进行比较,而下载的json有15-20个字段,但数据库有300个字段 现在,我的任务是将下载的JSON与mongodb数据进行比较,并获取与过去数据更改的任何字段 样本数据集 从API下载的数据 Mongodb数据 由于列的长度,我不能使用except 预期产量 请找到下面相同的源代码。这里我以唯一的电话号码条件为例 val list = List((1,"tony",123,"a@g.com"), (2,"stark"
请找到下面相同的源代码。这里我以唯一的电话号码条件为例
val list = List((1,"tony",123,"a@g.com"), (2,"stark",456,"b@g.com")
(3,"spidy",789,"c@g.com"))
val df1 = list.toDF("StudentId","Name","Phone","Email")
.select('StudentId as "StudentId_1", 'Name as "Name_1",'Phone as "Phone_1",
'Email as "Email_1")
df1.show()
val list1 = List((1,"tony",1234,"a@g.com","NY","Nowhere"),
(2,"stark",456,"bg@g.com", "NY", "Nowhere"),
(3,"spidy",789,"c@g.com","OH","Nowhere"))
val df2 = list1.toDF("StudentId","Name","Phone","Email","State","City")
.select('StudentId as "StudentId_2", 'Name as "Name_2", 'Phone as "Phone_2",
'Email as "Email_2", 'State as "State_2", 'City as "City_2")
df2.show()
val df3 = df1.join(df2, df1("StudentId_1") ===
df2("StudentId_2")).where(df1("Phone_1") =!= df2("Phone_2"))
df3.withColumnRenamed("Phone_1", "Past_Phone").show()
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a@g.com|
| 2| stark| 456|b@g.com|
| 3| spidy| 789|c@g.com|
+-----------+------+-------+-------+
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a@g.com| NY|Nowhere|
| 2| stark| 456|bg@g.com| NY|Nowhere|
| 3| spidy| 789| c@g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
|StudentId_1|Name_1|Past_Phone|Email_1|StudentId_2|Name_2|Phone_2|Email_2|State_2| City_2|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
| 1| tony| 123|a@g.com| 1| tony| 1234|a@g.com| NY|Nowhere|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
假设您的数据位于2个数据帧中。我们可以为它们创建临时视图,如下所示
api_df.createOrReplaceTempView("api_data")
mongo_df.createOrReplaceTempView("mongo_data")
接下来我们可以使用Spark SQL。在这里,我们使用StudentId
列将这两个视图连接起来,然后在它们上面使用一个case语句来计算过去的电话号码和电子邮件
spark.sql("""
select a.*
, case when a.Phone = b.Phone then '' else b.Phone end as Past_phone
, case when a.Email = b.Email then '' else b.Email end as Past_Email
from api_data a
join mongo_data b
on a.StudentId = b.StudentId
order by a.StudentId""").show()
输出:
+---------+-----+-----+-------+----------+----------+
|StudentId| Name|Phone| Email|Past_phone|Past_Email|
+---------+-----+-----+-------+----------+----------+
| 1| tony| 123|a@g.com| 1234| |
| 2|stark| 456|b@g.com| | bg@g.com|
| 3|spidy| 789|c@g.com| | |
+---------+-----+-----+-------+----------+----------+
我们有:
df1.show
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a@g.com|
| 2| stark| 456|b@g.com|
| 3| spidy| 789|c@g.com|
+-----------+------+-------+-------+
df2.show
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a@g.com| NY|Nowhere|
| 2| stark| 456|bg@g.com| NY|Nowhere|
| 3| spidy| 789| c@g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
加入后:
var jn = df2.join(df1,df1("StudentId_1")===df2("StudentId_2"))
然后
参考:
下一步:
ans.select(ans("StudentId_2") as "StudentId",ans("Name_2") as "Name",ans("Phone_2") as "Phone",ans("Email_2") as "Email",ans("Past_Email"),ans("Past_Phone")).show
+---------+-----+-----+--------+----------+----------+
|StudentId| Name|Phone| Email|Past_Email|Past_Phone|
+---------+-----+-----+--------+----------+----------+
| 1| tony| 1234| a@g.com| | 123|
| 2|stark| 456|bg@g.com| b@g.com| |
| 3|spidy| 789| c@g.com| | |
+---------+-----+-----+--------+----------+----------+
谢谢您的快速回答,但是,我期望的表不应该有state,city。如果不是空的话,应该只显示更改的字段。我提出,我不知道df1中的哪些字段正在更改,所以它应该是dynamicOkay。我们是否有任何特定的列始终存在?
df1.show
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a@g.com|
| 2| stark| 456|b@g.com|
| 3| spidy| 789|c@g.com|
+-----------+------+-------+-------+
df2.show
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a@g.com| NY|Nowhere|
| 2| stark| 456|bg@g.com| NY|Nowhere|
| 3| spidy| 789| c@g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
var jn = df2.join(df1,df1("StudentId_1")===df2("StudentId_2"))
var ans = jn.withColumn("Past_Phone", when(jn("Phone_2").notEqual(jn("Phone_1")),jn("Phone_1")).otherwise("")).withColumn("Past_Email", when(jn("Email_2").notEqual(jn("Email_1")),jn("Email_1")).otherwise(""))
ans.select(ans("StudentId_2") as "StudentId",ans("Name_2") as "Name",ans("Phone_2") as "Phone",ans("Email_2") as "Email",ans("Past_Email"),ans("Past_Phone")).show
+---------+-----+-----+--------+----------+----------+
|StudentId| Name|Phone| Email|Past_Email|Past_Phone|
+---------+-----+-----+--------+----------+----------+
| 1| tony| 1234| a@g.com| | 123|
| 2|stark| 456|bg@g.com| b@g.com| |
| 3|spidy| 789| c@g.com| | |
+---------+-----+-----+--------+----------+----------+