Apache spark 如何使用窗口规范和每列连接条件值？_Apache Spark_Apache Spark Sql

Apache spark 如何使用窗口规范和每列连接条件值？

apache-spark

Apache spark 如何使用窗口规范和每列连接条件值？,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,这是我的DF1 OrganizationId ^^年度周期^非周期^ InterimNumber ^ FFAction 4295858898 | | | 204 | | | | 205 | | | | | 1 | | | I || 4295858898 ^ ^ | 204 ^ ^ | 208 ^ ^ | 2 ^ ^ | I | | | | | | ^| 4295858898 | | | 204 | | | 209 | | | | 2 | | | I || 4295858898 | | | | 2

这是我的DF1

OrganizationId ^^年度周期^非周期^ InterimNumber ^ FFAction
4295858898 | | | 204 | | | | 205 | | | | | 1 | | | I ||
4295858898 ^ ^ | 204 ^ ^ | 208 ^ ^ | 2 ^ ^ | I | | | | | | ^|
4295858898 | | | 204 | | | 209 | | | | 2 | | | I ||
4295858898 | | | | 204 | | | 211 | | | | 3 | | | I ||
4295858898 | | | 204 | | | 212 | | | | 3 | | | I ||
4295858898 | | | 204 | | | 214 | | | | 4 | | | I ||
4295858898 | | | | | 204 | | | | 215 | | | | 4 | | | I ||
4295858898 | | | 206 | | | 207 | | | | 1 | | | I ||
4295858898 | ^ 206 | ^ 210 | ^ 2 | ^ I ||
4295858898 | | | 206 | | | 213 | | | | 3 | | | I ||

这是我的DF2

DataPartition ^^分区年^时间戳^组织ID ^年度周期^非周期^ InterimNumber ^ FFAction ^|
SelfSourcedPublic | 2002 | 1511224917595 | | 4295858941 | | 24 | | | 25 | | | 4 | | O ||
SelfSourcedPublic | 2002 | 1511224917596 | | 4295858941 | | 24 | | | 25 | | | 4 | | O ||
SelfSourcedPublic | 2003 | 1511224917597 | | 4295858941 | | 30 | | | 31 | | | 2 | | O ||
SelfSourcedPublic | 2003 | 1511224917598 | | 4295858941 | | | 30 | | | 31 | | | 2 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917599 | ^ 4295858941 | ^ 30 | ^ 32 | ^ 1 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917600 | ^ 4295858941 | ^ 30 | ^ 32 | ^ 1 | ^ O ||
SelfSourcedPublic 2002 1511224917601 4295858941 24 33 O|
SelfSourcedPublic | 2002 | 1511224917602 | | 4295858941 | | | 24 | | | 33 | | | 3 | | O ||
SelfSourcedPublic | 2002 | 1511224917603 | | 4295858941 | | | 24 | | | 34 | | | 2 | | | O ||
SelfSourcedPublic | 2002 | 1511224917604 | | 4295858941 | | | 24 | | | 34 | | | 2 | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917605 | | 4295858941 | | | 1 | | | 2 | | | 4 | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917606 | | 4295858941 | | | 1 | | | 3 | | | 4 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917607 | | 4295858941 | | | 5 | | | 6 | | | 4 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917608 | | 4295858941 | | | 5 | | | 7 | | | 4 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917609 | ^ 4295858941 | ^ 12 | ^ 10 | ^ 2 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917610 | ^ 4295858941 | ^ 12 | ^ 11 | ^ 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917611 | | 4295858941 | | 1 | | | 13 | | | 1 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917612 | ^ 4295858941 | ^ 12 | ^ 14 | ^ 1 | | O ||
SelfSourcedPublic | 2001 | 1511224917613 | | 4295858941 | | | 5 | | | | 3 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917614 | | | 4295858941 | | | | 5 | | 16 | | | | 3 | | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917615 | | 4295858941 | | | 1 | | | 17 | | | 3 | | | O ||
SelfSourcedPublic | 2002 | 1511224917616 | 4295858941 | 1 | 18 | 3 | O ||
SelfSourcedPublic | 2001 | | | | 1511224917617 | | 4295858941 | | | 5 | | | 19 | | | 1 | | | | | | ||
SelfSourcedPublic | 2001 | 1511224917618 | 4295858941 | 5 | 20 | 2 | O ||
SelfSourcedPublic | 2001 | 1511224917619 | | 4295858941 | | | 5 | | | 21 | | | 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917620 | | 4295858941 | | 1 | | 22 | | | 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917621 | | 4295858941 | | 1 | | 23 | | | 2 | | O ||
SelfSourcedPublic | 2016 | 1511224917622 | 4295858941 | 35 | 36 | 1 | I ||
SelfSourcedPublic | 2016 | 1511224917642 | | 4295858941 | |空| | 35 | | |空| | D ||
SelfSourcedPublic | 2016 | 1511224917643 | | 4295858941 | | null | | 36 | | | null | | D ||
SelfSourcedPublic | 2016 | 1511224917644 | | 4295858941 | |空| | 37 | | |空| | | D ||

我想基于列的值实现连接

这就是我试图在Spark Scala中实现的，但不知道如何实现它

如果DF2中的
FFAction_1=I
，则低于条件

（在“OrganizationId”、“AnnualPeriodId”、“Interimiodid”三列上连接和分割对象

val windowSpec=Window.partitionBy（“OrganizationId”、“AnnualPeriodId”、“Interiodid”）.orderBy（$“TimeStamp”.cast（LongType）.desc）
val latestForEachKey=df2result.withColumn（“rank”，rank（）.over（windowSpec））.filter（$“rank”==1）.drop（“rank”，“TimeStamp”）
val dfmain output=df1resultFinalWithYear.join（latestForEachKey，Seq（“OrganizationId”、“AnnualPeriodId”、“Interiodid”）、“outer”）
。选择（$“OrganizationId”、$“AnnualPeriodId”、$“Interiodid”，
当（$“FFAction_1.”不为空时，concat（col（“FFAction_1”），
否则（concat（col（“FFAction”）、lit（“|！”）、as（“FFAction”））
.filter（！$“FFAction”.contains（“D”））

如果
fAction_1=O或D
则低于条件

（两列上的join和partitionBy

“OrganizationId”、“Interuniodid”

）

下面是我的全部代码

val sqlContext=new org.apache.spark.sql.sqlContext（sc）
导入sqlContext.implicits_
导入org.apache.spark.{SparkConf，SparkContext}
导入java.sql.{Date，Timestamp}
导入org.apache.spark.sql.Row
导入org.apache.spark.sql.types_
导入org.apache.spark.sql.functions.udf
导入org.apache.spark.sql.functions.input_文件名
导入org.apache.spark.sql.functions.regexp\u extract
val get\u cus\u val=spark.udf.register（“get\u cus\u val”，（filePath:String）=>filePath.split（“\\”）（3））
val get\u cus\u YearPartition=spark.udf.register（“get\u cus\u YearPartition”，（filePath:String）=>filePath.split（“\\”）（4））
val rdd=sc.textFile（“s3://trfsmallffile/Interim2Annual/MAIN”）
val header=rdd.filter（\.contains（“OrganizationId”））.map（line=>line.split（“\\\\\^\\\\\”））.first（）
val schema=StructType（header.map（cols=>StructField（cols.replace（“.”，“”），StringType））.toSeq）
val data=sqlContext.createDataFrame（rdd.filter（！\u0.contains（“OrganizationId”））.map（line=>Row.fromSeq（line.split（\\\\\\\\\\\\\\\\\\\\\\”）、模式）
val schemaHeader=StructType（header.map（cols=>StructField（cols.replace（“.”，“”），StringType））.toSeq）
val dataHeader=sqlContext.createDataFrame（rdd.filter（！\u0.contains（“OrganizationId”））.map（line=>Row.fromSeq（line.split（\\\\\\\\\\\\\\\\\\\\\\\\\\\\”.toSeq）），schemaHeader）
val df1resultFinal=data.withColumn（“DataPartition”，get_cus_val（input_file_name））
val df1resultfinalwhithyear=df1resultFinal.withColumn（“PartitionYear”，获取客户年分区（输入文件名））
//加载增量
val rdd1=sc.textFile（“s3://trfsmallffile/Interim2Annual/INCR”）
val header1=rdd1.filter（\.contains（“OrganizationId”））.map（line=>line.split（“\\\\\\^\\\\\\”））.first（）
val schema1=StructType（header1.map）（cols=>StructField（cols.replace（）
val windowSpec = Window.partitionBy("OrganizationId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 

val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("OrganizationId","AnnualPeriodId","InterimPeriodId"), "outer")

.select($"OrganizationId", $"AnnualPeriodId",$"InterimPeriodId",
   when($"FFAction_1".isNotNull, concat(col("FFAction_1"), 
   lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
   .filter(!$"FFAction".contains("D"))

val df1 = spark.
  read.
  option("header", true).
  option("sep", "|").
  csv("df1.csv").
  select("OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber", "FFAction")
scala> df1.show
+--------------+--------------+---------------+-------------+--------+
|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber|FFAction|
+--------------+--------------+---------------+-------------+--------+
|    4295858898|           204|            205|            1|       I|
|    4295858898|           204|            208|            2|       I|
|    4295858898|           204|            209|            2|       I|
|    4295858898|           204|            211|            3|       I|
|    4295858898|           204|            212|            3|       I|
|    4295858898|           204|            214|            4|       I|
|    4295858898|           204|            215|            4|       I|
|    4295858898|           206|            207|            1|       I|
|    4295858898|           206|            210|            2|       I|
|    4295858898|           206|            213|            3|       I|
+--------------+--------------+---------------+-------------+--------+

val df2 = spark.
  read.
  option("header", true).
  option("sep", "|").
  csv("df2.csv").
  select("DataPartition_1", "PartitionYear_1", "TimeStamp", "OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber_1", "FFAction_1")
scala> df2.show
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
|  DataPartition_1|PartitionYear_1|    TimeStamp|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber_1|FFAction_1|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
|SelfSourcedPublic|           2002|1510725106270|    4295858941|            24|             25|              4|         O|
|SelfSourcedPublic|           2002|1510725106271|    4295858941|            24|             25|              5|         O|
|SelfSourcedPublic|           2003|1510725106272|    4295858941|            30|             31|              2|         O|
|SelfSourcedPublic|           2003|1510725106273|    4295858941|            30|             31|              3|         O|
|SelfSourcedPublic|           2001|1510725106293|    4295858941|             5|             20|              2|         O|
|SelfSourcedPublic|           2001|1510725106294|    4295858941|             5|             21|              3|         O|
|SelfSourcedPublic|           2002|1510725106295|    4295858941|             1|             22|              4|         O|
|SelfSourcedPublic|           2002|1510725106296|    4295858941|             1|             23|              5|         O|
|SelfSourcedPublic|           2016|1510725106297|    4295858941|            35|             36|              1|         I|
|SelfSourcedPublic|           2016|1510725106297|    4295858941|            35|             36|              1|         D|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+

val noIs = df2.filter($"FFAction_1" === "I").take(1).isEmpty
val (windowSpec, joinCond) = if (noIs) {
  (windowSpecForOs, joinForOs) 
} else {
  (windowSpecForIs, joinForIs)
}
val latestForEachKey = df2result.withColumn("rank", rank() over windowSpec)
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey).where(joinCond)