Apache spark 如何使用窗口规范和每列连接条件值?
这是我的DF1Apache spark 如何使用窗口规范和每列连接条件值?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,这是我的DF1 OrganizationId ^^年度周期^非周期^ InterimNumber ^ FFAction 4295858898 | | | 204 | | | | 205 | | | | | 1 | | | I || 4295858898 ^ ^ | 204 ^ ^ | 208 ^ ^ | 2 ^ ^ | I | | | | | | ^| 4295858898 | | | 204 | | | 209 | | | | 2 | | | I || 4295858898 | | | | 2
OrganizationId ^^年度周期^非周期^ InterimNumber ^ FFAction
4295858898 | | | 204 | | | | 205 | | | | | 1 | | | I ||
4295858898 ^ ^ | 204 ^ ^ | 208 ^ ^ | 2 ^ ^ | I | | | | | | ^|
4295858898 | | | 204 | | | 209 | | | | 2 | | | I ||
4295858898 | | | | 204 | | | 211 | | | | 3 | | | I ||
4295858898 | | | 204 | | | 212 | | | | 3 | | | I ||
4295858898 | | | 204 | | | 214 | | | | 4 | | | I ||
4295858898 | | | | | 204 | | | | 215 | | | | 4 | | | I ||
4295858898 | | | 206 | | | 207 | | | | 1 | | | I ||
4295858898 | ^ 206 | ^ 210 | ^ 2 | ^ I ||
4295858898 | | | 206 | | | 213 | | | | 3 | | | I ||
这是我的DF2
DataPartition ^^分区年^时间戳^组织ID ^年度周期^非周期^ InterimNumber ^ FFAction ^|
SelfSourcedPublic | 2002 | 1511224917595 | | 4295858941 | | 24 | | | 25 | | | 4 | | O ||
SelfSourcedPublic | 2002 | 1511224917596 | | 4295858941 | | 24 | | | 25 | | | 4 | | O ||
SelfSourcedPublic | 2003 | 1511224917597 | | 4295858941 | | 30 | | | 31 | | | 2 | | O ||
SelfSourcedPublic | 2003 | 1511224917598 | | 4295858941 | | | 30 | | | 31 | | | 2 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917599 | ^ 4295858941 | ^ 30 | ^ 32 | ^ 1 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917600 | ^ 4295858941 | ^ 30 | ^ 32 | ^ 1 | ^ O ||
SelfSourcedPublic 2002 1511224917601 4295858941 24 33 O|
SelfSourcedPublic | 2002 | 1511224917602 | | 4295858941 | | | 24 | | | 33 | | | 3 | | O ||
SelfSourcedPublic | 2002 | 1511224917603 | | 4295858941 | | | 24 | | | 34 | | | 2 | | | O ||
SelfSourcedPublic | 2002 | 1511224917604 | | 4295858941 | | | 24 | | | 34 | | | 2 | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917605 | | 4295858941 | | | 1 | | | 2 | | | 4 | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917606 | | 4295858941 | | | 1 | | | 3 | | | 4 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917607 | | 4295858941 | | | 5 | | | 6 | | | 4 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917608 | | 4295858941 | | | 5 | | | 7 | | | 4 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917609 | ^ 4295858941 | ^ 12 | ^ 10 | ^ 2 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917610 | ^ 4295858941 | ^ 12 | ^ 11 | ^ 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917611 | | 4295858941 | | 1 | | | 13 | | | 1 | | O ||
SelfSourcedPublic | ^ 2003 | ^ 1511224917612 | ^ 4295858941 | ^ 12 | ^ 14 | ^ 1 | | O ||
SelfSourcedPublic | 2001 | 1511224917613 | | 4295858941 | | | 5 | | | | 3 | | O ||
SelfSourcedPublic | 2001 | | | | 1511224917614 | | | 4295858941 | | | | 5 | | 16 | | | | 3 | | | | O ||
SelfSourcedPublic | 2002 | | | | 1511224917615 | | 4295858941 | | | 1 | | | 17 | | | 3 | | | O ||
SelfSourcedPublic | 2002 | 1511224917616 | 4295858941 | 1 | 18 | 3 | O ||
SelfSourcedPublic | 2001 | | | | 1511224917617 | | 4295858941 | | | 5 | | | 19 | | | 1 | | | | | | ||
SelfSourcedPublic | 2001 | 1511224917618 | 4295858941 | 5 | 20 | 2 | O ||
SelfSourcedPublic | 2001 | 1511224917619 | | 4295858941 | | | 5 | | | 21 | | | 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917620 | | 4295858941 | | 1 | | 22 | | | 2 | | O ||
SelfSourcedPublic | 2002 | 1511224917621 | | 4295858941 | | 1 | | 23 | | | 2 | | O ||
SelfSourcedPublic | 2016 | 1511224917622 | 4295858941 | 35 | 36 | 1 | I ||
SelfSourcedPublic | 2016 | 1511224917642 | | 4295858941 | |空| | 35 | | |空| | D ||
SelfSourcedPublic | 2016 | 1511224917643 | | 4295858941 | | null | | 36 | | | null | | D ||
SelfSourcedPublic | 2016 | 1511224917644 | | 4295858941 | |空| | 37 | | |空| | | D ||
我想基于列的值实现连接
这就是我试图在Spark Scala中实现的,但不知道如何实现它
如果DF2中的FFAction_1=I
,则低于条件
(在“OrganizationId”、“AnnualPeriodId”、“Interimiodid”三列上连接和分割对象
val windowSpec=Window.partitionBy(“OrganizationId”、“AnnualPeriodId”、“Interiodid”).orderBy($“TimeStamp”.cast(LongType).desc)
val latestForEachKey=df2result.withColumn(“rank”,rank().over(windowSpec)).filter($“rank”==1).drop(“rank”,“TimeStamp”)
val dfmain output=df1resultFinalWithYear.join(latestForEachKey,Seq(“OrganizationId”、“AnnualPeriodId”、“Interiodid”)、“outer”)
。选择($“OrganizationId”、$“AnnualPeriodId”、$“Interiodid”,
当($“FFAction_1.”不为空时,concat(col(“FFAction_1”),
否则(concat(col(“FFAction”)、lit(“|!”)、as(“FFAction”))
.filter(!$“FFAction”.contains(“D”))
如果fAction_1=O或D
则低于条件
(两列上的join和partitionBy“OrganizationId”、“Interuniodid”
)
下面是我的全部代码
val sqlContext=new org.apache.spark.sql.sqlContext(sc)
导入sqlContext.implicits_
导入org.apache.spark.{SparkConf,SparkContext}
导入java.sql.{Date,Timestamp}
导入org.apache.spark.sql.Row
导入org.apache.spark.sql.types_
导入org.apache.spark.sql.functions.udf
导入org.apache.spark.sql.functions.input_文件名
导入org.apache.spark.sql.functions.regexp\u extract
val get\u cus\u val=spark.udf.register(“get\u cus\u val”,(filePath:String)=>filePath.split(“\\”)(3))
val get\u cus\u YearPartition=spark.udf.register(“get\u cus\u YearPartition”,(filePath:String)=>filePath.split(“\\”)(4))
val rdd=sc.textFile(“s3://trfsmallffile/Interim2Annual/MAIN”)
val header=rdd.filter(\.contains(“OrganizationId”)).map(line=>line.split(“\\\\\^\\\\\”)).first()
val schema=StructType(header.map(cols=>StructField(cols.replace(“.”,“”),StringType)).toSeq)
val data=sqlContext.createDataFrame(rdd.filter(!\u0.contains(“OrganizationId”)).map(line=>Row.fromSeq(line.split(\\\\\\\\\\\\\\\\\\\\\\”)、模式)
val schemaHeader=StructType(header.map(cols=>StructField(cols.replace(“.”,“”),StringType)).toSeq)
val dataHeader=sqlContext.createDataFrame(rdd.filter(!\u0.contains(“OrganizationId”)).map(line=>Row.fromSeq(line.split(\\\\\\\\\\\\\\\\\\\\\\\\\\\\”.toSeq)),schemaHeader)
val df1resultFinal=data.withColumn(“DataPartition”,get_cus_val(input_file_name))
val df1resultfinalwhithyear=df1resultFinal.withColumn(“PartitionYear”,获取客户年分区(输入文件名))
//加载增量
val rdd1=sc.textFile(“s3://trfsmallffile/Interim2Annual/INCR”)
val header1=rdd1.filter(\.contains(“OrganizationId”)).map(line=>line.split(“\\\\\\^\\\\\\”)).first()
val schema1=StructType(header1.map)(cols=>StructField(cols.replace()
val windowSpec = Window.partitionBy("OrganizationId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("OrganizationId","AnnualPeriodId","InterimPeriodId"), "outer")
.select($"OrganizationId", $"AnnualPeriodId",$"InterimPeriodId",
when($"FFAction_1".isNotNull, concat(col("FFAction_1"),
lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
val df1 = spark.
read.
option("header", true).
option("sep", "|").
csv("df1.csv").
select("OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber", "FFAction")
scala> df1.show
+--------------+--------------+---------------+-------------+--------+
|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber|FFAction|
+--------------+--------------+---------------+-------------+--------+
| 4295858898| 204| 205| 1| I|
| 4295858898| 204| 208| 2| I|
| 4295858898| 204| 209| 2| I|
| 4295858898| 204| 211| 3| I|
| 4295858898| 204| 212| 3| I|
| 4295858898| 204| 214| 4| I|
| 4295858898| 204| 215| 4| I|
| 4295858898| 206| 207| 1| I|
| 4295858898| 206| 210| 2| I|
| 4295858898| 206| 213| 3| I|
+--------------+--------------+---------------+-------------+--------+
val df2 = spark.
read.
option("header", true).
option("sep", "|").
csv("df2.csv").
select("DataPartition_1", "PartitionYear_1", "TimeStamp", "OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber_1", "FFAction_1")
scala> df2.show
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
| DataPartition_1|PartitionYear_1| TimeStamp|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber_1|FFAction_1|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
|SelfSourcedPublic| 2002|1510725106270| 4295858941| 24| 25| 4| O|
|SelfSourcedPublic| 2002|1510725106271| 4295858941| 24| 25| 5| O|
|SelfSourcedPublic| 2003|1510725106272| 4295858941| 30| 31| 2| O|
|SelfSourcedPublic| 2003|1510725106273| 4295858941| 30| 31| 3| O|
|SelfSourcedPublic| 2001|1510725106293| 4295858941| 5| 20| 2| O|
|SelfSourcedPublic| 2001|1510725106294| 4295858941| 5| 21| 3| O|
|SelfSourcedPublic| 2002|1510725106295| 4295858941| 1| 22| 4| O|
|SelfSourcedPublic| 2002|1510725106296| 4295858941| 1| 23| 5| O|
|SelfSourcedPublic| 2016|1510725106297| 4295858941| 35| 36| 1| I|
|SelfSourcedPublic| 2016|1510725106297| 4295858941| 35| 36| 1| D|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
val noIs = df2.filter($"FFAction_1" === "I").take(1).isEmpty
val (windowSpec, joinCond) = if (noIs) {
(windowSpecForOs, joinForOs)
} else {
(windowSpecForIs, joinForIs)
}
val latestForEachKey = df2result.withColumn("rank", rank() over windowSpec)
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey).where(joinCond)