Apache spark 如何使用窗口规范和每列连接条件值?

Apache spark 如何使用窗口规范和每列连接条件值?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,这是我的DF1 OrganizationId ^^年度周期^非周期^ InterimNumber ^ FFAction 4295858898 | | | 204 | | | | 205 | | | | | 1 | | | I || 4295858898 ^ ^ | 204 ^ ^ | 208 ^ ^ | 2 ^ ^ | I | | | | | | ^| 4295858898 | | | 204 | | | 209 | | | | 2 | | | I || 4295858898 | | | | 2

这是我的DF1

OrganizationId ^^年度周期^非周期^ InterimNumber ^ FFAction
4295858898 | | | 204 | | | | 205 | | | | | 1 | | | I ||
4295858898 ^ ^ | 204 ^ ^ | 208 ^ ^ | 2 ^ ^ | I | | | | | | ^|
4295858898 | | | 204 | | | 209 | | | | 2 | | | I ||
4295858898 | | | | 204 | | | 211 | | | | 3 | | | I ||
4295858898 | | | 204 | | | 212 | | | | 3 | | | I ||
4295858898 | | | 204 | | | 214 | | | | 4 | | | I ||
4295858898 | | | | | 204 | | | | 215 | | | | 4 | | | I ||
4295858898 | | | 206 | | | 207 | | | | 1 | | | I ||
4295858898 | ^ 206 | ^ 210 | ^ 2 | ^ I ||
4295858898 | | | 206 | | | 213 | | | | 3 | | | I ||
这是我的DF2

DataPartition ^^分区年^时间戳^组织ID ^年度周期^非周期^ InterimNumber ^ FFAction ^|
SelfSourcedPublic | 2002 | 1511224917595 | | 4295858941 | | 24 | | | 25 | | | 4 | | O ||
SelfSourcedPublic | 2002 | 1511224917596 | | 4295858941 | | 24 | | | 25 | | | 4 | | O ||
SelfSourcedPublic | 2003 | 1511224917597 | | 4295858941 | | 30 | | | 31 | | | 2 | | O ||
SelfSourcedPublic | 2003 | 1511224917598 | | 4295858941 | | | 30 | | | 31 | | | 2 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917599 | ^ 4295858941 | ^ 30 | ^ 32 | ^ 1 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917600 | ^ 4295858941 | ^ 30 | ^ 32 | ^ 1 | ^ O ||
SelfSourcedPublic 2002 1511224917601 4295858941 24 33 O|
SelfSourcedPublic | 2002 | 1511224917602 | | 4295858941 | | | 24 | | | 33 | | | 3 | | O ||
SelfSourcedPublic | 2002 | 1511224917603 | | 4295858941 | | | 24 | | | 34 | | | 2 | | | O ||
SelfSourcedPublic | 2002 | 1511224917604 | | 4295858941 | | | 24 | | | 34 | | | 2 | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917605 | | 4295858941 | | | 1 | | | 2 | | | 4 | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917606 | | 4295858941 | | | 1 | | | 3 | | | 4 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917607 | | 4295858941 | | | 5 | | | 6 | | | 4 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917608 | | 4295858941 | | | 5 | | | 7 | | | 4 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917609 | ^ 4295858941 | ^ 12 | ^ 10 | ^ 2 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917610 | ^ 4295858941 | ^ 12 | ^ 11 | ^ 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917611 | | 4295858941 | | 1 | | | 13 | | | 1 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917612 | ^ 4295858941 | ^ 12 | ^ 14 | ^ 1 | | O ||
SelfSourcedPublic | 2001 | 1511224917613 | | 4295858941 | | | 5 | | | | 3 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917614 | | | 4295858941 | | | | 5 | | 16 | | | | 3 | | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917615 | | 4295858941 | | | 1 | | | 17 | | | 3 | | | O ||
SelfSourcedPublic | 2002 | 1511224917616 | 4295858941 | 1 | 18 | 3 | O ||
SelfSourcedPublic | 2001 | | | | 1511224917617 | | 4295858941 | | | 5 | | | 19 | | | 1 | | | | | | ||
SelfSourcedPublic | 2001 | 1511224917618 | 4295858941 | 5 | 20 | 2 | O ||
SelfSourcedPublic | 2001 | 1511224917619 | | 4295858941 | | | 5 | | | 21 | | | 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917620 | | 4295858941 | | 1 | | 22 | | | 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917621 | | 4295858941 | | 1 | | 23 | | | 2 | | O ||
SelfSourcedPublic | 2016 | 1511224917622 | 4295858941 | 35 | 36 | 1 | I ||
SelfSourcedPublic | 2016 | 1511224917642 | | 4295858941 | |空| | 35 | | |空| | D ||
SelfSourcedPublic | 2016 | 1511224917643 | | 4295858941 | | null | | 36 | | | null | | D ||
SelfSourcedPublic | 2016 | 1511224917644 | | 4295858941 | |空| | 37 | | |空| | | D ||
我想基于列的值实现连接

这就是我试图在Spark Scala中实现的,但不知道如何实现它

如果DF2中的
FFAction_1=I
,则低于条件

(在“OrganizationId”、“AnnualPeriodId”、“Interimiodid”三列上连接和分割对象

val windowSpec=Window.partitionBy(“OrganizationId”、“AnnualPeriodId”、“Interiodid”).orderBy($“TimeStamp”.cast(LongType).desc)
val latestForEachKey=df2result.withColumn(“rank”,rank().over(windowSpec)).filter($“rank”==1).drop(“rank”,“TimeStamp”)
val dfmain output=df1resultFinalWithYear.join(latestForEachKey,Seq(“OrganizationId”、“AnnualPeriodId”、“Interiodid”)、“outer”)
。选择($“OrganizationId”、$“AnnualPeriodId”、$“Interiodid”,
当($“FFAction_1.”不为空时,concat(col(“FFAction_1”),
否则(concat(col(“FFAction”)、lit(“|!”)、as(“FFAction”))
.filter(!$“FFAction”.contains(“D”))
如果
fAction_1=O或D
则低于条件

(两列上的join和partitionBy
“OrganizationId”、“Interuniodid”

下面是我的全部代码

val sqlContext=new org.apache.spark.sql.sqlContext(sc)
导入sqlContext.implicits_
导入org.apache.spark.{SparkConf,SparkContext}
导入java.sql.{Date,Timestamp}
导入org.apache.spark.sql.Row
导入org.apache.spark.sql.types_
导入org.apache.spark.sql.functions.udf
导入org.apache.spark.sql.functions.input_文件名
导入org.apache.spark.sql.functions.regexp\u extract
val get\u cus\u val=spark.udf.register(“get\u cus\u val”,(filePath:String)=>filePath.split(“\\”)(3))
val get\u cus\u YearPartition=spark.udf.register(“get\u cus\u YearPartition”,(filePath:String)=>filePath.split(“\\”)(4))
val rdd=sc.textFile(“s3://trfsmallffile/Interim2Annual/MAIN”)
val header=rdd.filter(\.contains(“OrganizationId”)).map(line=>line.split(“\\\\\^\\\\\”)).first()
val schema=StructType(header.map(cols=>StructField(cols.replace(“.”,“”),StringType)).toSeq)
val data=sqlContext.createDataFrame(rdd.filter(!\u0.contains(“OrganizationId”)).map(line=>Row.fromSeq(line.split(\\\\\\\\\\\\\\\\\\\\\\”)、模式)
val schemaHeader=StructType(header.map(cols=>StructField(cols.replace(“.”,“”),StringType)).toSeq)
val dataHeader=sqlContext.createDataFrame(rdd.filter(!\u0.contains(“OrganizationId”)).map(line=>Row.fromSeq(line.split(\\\\\\\\\\\\\\\\\\\\\\\\\\\\”.toSeq)),schemaHeader)
val df1resultFinal=data.withColumn(“DataPartition”,get_cus_val(input_file_name))
val df1resultfinalwhithyear=df1resultFinal.withColumn(“PartitionYear”,获取客户年分区(输入文件名))
//加载增量
val rdd1=sc.textFile(“s3://trfsmallffile/Interim2Annual/INCR”)
val header1=rdd1.filter(\.contains(“OrganizationId”)).map(line=>line.split(“\\\\\\^\\\\\\”)).first()
val schema1=StructType(header1.map)(cols=>StructField(cols.replace()
val windowSpec = Window.partitionBy("OrganizationId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 

val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("OrganizationId","AnnualPeriodId","InterimPeriodId"), "outer")

.select($"OrganizationId", $"AnnualPeriodId",$"InterimPeriodId",
   when($"FFAction_1".isNotNull, concat(col("FFAction_1"), 
   lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
   .filter(!$"FFAction".contains("D"))
val df1 = spark.
  read.
  option("header", true).
  option("sep", "|").
  csv("df1.csv").
  select("OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber", "FFAction")
scala> df1.show
+--------------+--------------+---------------+-------------+--------+
|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber|FFAction|
+--------------+--------------+---------------+-------------+--------+
|    4295858898|           204|            205|            1|       I|
|    4295858898|           204|            208|            2|       I|
|    4295858898|           204|            209|            2|       I|
|    4295858898|           204|            211|            3|       I|
|    4295858898|           204|            212|            3|       I|
|    4295858898|           204|            214|            4|       I|
|    4295858898|           204|            215|            4|       I|
|    4295858898|           206|            207|            1|       I|
|    4295858898|           206|            210|            2|       I|
|    4295858898|           206|            213|            3|       I|
+--------------+--------------+---------------+-------------+--------+
val df2 = spark.
  read.
  option("header", true).
  option("sep", "|").
  csv("df2.csv").
  select("DataPartition_1", "PartitionYear_1", "TimeStamp", "OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber_1", "FFAction_1")
scala> df2.show
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
|  DataPartition_1|PartitionYear_1|    TimeStamp|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber_1|FFAction_1|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
|SelfSourcedPublic|           2002|1510725106270|    4295858941|            24|             25|              4|         O|
|SelfSourcedPublic|           2002|1510725106271|    4295858941|            24|             25|              5|         O|
|SelfSourcedPublic|           2003|1510725106272|    4295858941|            30|             31|              2|         O|
|SelfSourcedPublic|           2003|1510725106273|    4295858941|            30|             31|              3|         O|
|SelfSourcedPublic|           2001|1510725106293|    4295858941|             5|             20|              2|         O|
|SelfSourcedPublic|           2001|1510725106294|    4295858941|             5|             21|              3|         O|
|SelfSourcedPublic|           2002|1510725106295|    4295858941|             1|             22|              4|         O|
|SelfSourcedPublic|           2002|1510725106296|    4295858941|             1|             23|              5|         O|
|SelfSourcedPublic|           2016|1510725106297|    4295858941|            35|             36|              1|         I|
|SelfSourcedPublic|           2016|1510725106297|    4295858941|            35|             36|              1|         D|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
val noIs = df2.filter($"FFAction_1" === "I").take(1).isEmpty
val (windowSpec, joinCond) = if (noIs) {
  (windowSpecForOs, joinForOs) 
} else {
  (windowSpecForIs, joinForIs)
}
val latestForEachKey = df2result.withColumn("rank", rank() over windowSpec)
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey).where(joinCond)