Scala 如何计算两个数据帧之间的增量?
我想计算两个表之间的增量(当前已满和昨天已满) 我在键上的df_current_full和df_previous_full之间做了一个完整的外部连接Scala 如何计算两个数据帧之间的增量?,scala,dataframe,apache-spark,Scala,Dataframe,Apache Spark,我想计算两个表之间的增量(当前已满和昨天已满) 我在键上的df_current_full和df_previous_full之间做了一个完整的外部连接 val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable .join(df_previousFullCurrentView, df_currentFullTable(key) === df_previousFullCurre
val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
.join(df_previousFullCurrentView, df_currentFullTable(key) ===
df_previousFullCurrentView(key), "full_outer")
为了知道是否删除或创建了这些行,我可以简单地执行以下操作:
val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
.join(df_previousFullCurrentView, df_currentFullTable(key) === df_previousFullCurrentView(key), "full_outer")
.withColumn("flagCreatedDeleted", UDF_udfCreateFlagCreatedDeleted(df_currentFullTable(key),
df_previousFullCurrentView(key)))
val UDF_udfCreateFlagCreatedDeleted = udf(udfCreateFlagCreatedDeleted _)
def udfCreateFlagCreatedDeleted(df_currentFullTable_key: String, df_currentPreviousTable_key: String): String = {
if (df_currentFullTable_key == null && df_currentPreviousTable_key != null) return "S"
else if (df_currentFullTable_key != null && df_currentPreviousTable_key == null) return "C"
else return null
}
但是我对修改的行有问题吗?我怎样才能找回它们?
表中有string、int和date列
谢谢你的帮助
如果我这样做,代码就会变得很长
我有50列,类型不一样
val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
.join(df_previousFullCurrentView, df_currentFullTable(key) === df_previousFullCurrentView(key), "full_outer")
.withColumn("flagCreatedDeleted", UDF_udfCreateFlagCreatedDeleted(df_currentFullTable(key),
df_previousFullCurrentView(key)))
.withColumn("flagModifiedStringNameId", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
df_previousFullCurrentView(key), df_currentFullTable("name_id"), df_previousFullCurrentView("name_id")))
.withColumn("flagModifiedStringSurname", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
df_previousFullCurrentView(key), df_currentFullTable("Surname"), df_previousFullCurrentView("Surname")))
.withColumn("flagModifiedStringAge", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
df_previousFullCurrentView(key), df_currentFullTable("Age"), df_previousFullCurrentView("Age")))
.withColumn("flagModifiedStringWorkingE", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
df_previousFullCurrentView(key), df_currentFullTable("WorkingE"), df_previousFullCurrentView("Working")))
val UDF_udfCreateFlagModifiedString = udf(udfCreateFlagModifiedString _)
def udfCreateFlagModifiedString(df_currentFullTable_key: String, df_currentPreviousTable_key: String,
CurrentStringModified: String, PreviousStringModified: String): String = {
if (df_currentFullTable_key == df_currentPreviousTable_key &&
CurrentStringModified != PreviousStringModified)
return "U"
else return null
}
你也可以这样做:
val isUpdatedColumnUDF = udf(isUpdatedColumn _)
def isUpdatedColumn(currentColumn: String, previousColumn: String): String =
if (previousColumn != currentColumn) return "updated"
else null
对此,您甚至不需要自定义项:如果
previous.id
为空,则创建该行;如果current.id
为空,则删除该行。如果两者都不为null,则表示该行存在于两个数据帧中,因此可以检查两行的相等性。如果它们不同,就意味着有一个更新
val prev=Seq(数据(1,“foo”,“bar”)、数据(2,“foo2”,“bar2”)、数据(3,“foo3”,“bar3”))。toDF
val curr=Seq(数据(1,“foo”,“barNew”)、数据(3,“foo3”,“bar3”)、数据(4,“foo4”,“bar4”))。toDF
上一个createOrReplaceTempView(“上一个完整”)
当前createOrReplaceTempView(“当前已满”)
spark.sql(“”)
选择*,
(如果前一个_full.id为空,则为“C”
当当前_full.id为null时,则为'S'
当struct(上一个_full.*)struct(当前_full.*)时,则为'U'
else null end)作为标志
从上一个完整的
完全外部联接当前在上一个上一个上一个上一个上一个上一个上一个上一个上一个上一个上一个
/*
+----+----+----+----+----+------+----+
|id | x | y | id | x | y |旗|
+----+----+----+----+----+------+----+
|1 | foo | bar | 1 | foo | barNew | U|
|3 | foo3 | bar3 | 3 | foo3 | bar3 | null|
|空|空|空| 4 | foo4 | bar4 | C|
|2 | foo2 | bar2 | null | null | null | S|
+----+----+----+----+----+------+----+
*/
我通过添加更多细节来更改问题谢谢。我必须使用DataFrames函数而不是sql函数来实现这一点。
val isUpdatedColumnUDF = udf(isUpdatedColumn _)
def isUpdatedColumn(currentColumn: String, previousColumn: String): String =
if (previousColumn != currentColumn) return "updated"
else null