Spark Scala:基于多列和多条件过滤Spark数据集
我很难找到过滤spark数据集的好方法。我已经在下面描述了基本问题:Spark Scala:基于多列和多条件过滤Spark数据集,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我很难找到过滤spark数据集的好方法。我已经在下面描述了基本问题: 对于每个键,检查是否存在状态代码===UV 如果没有与该键关联的UV状态代码,则完全忽略该键。 请注意:每个键只能有一个UV 如果有,则搜索UV时间戳之后最近的OA事件。 请注意:UV时间戳之后可能有多个OA事件。我想要最接近UV时间戳的那个 如果唯一的OA事件发生在过去(即UV之前),我仍希望保留该记录,因为预期的OA将出现,但我仍希望用OA状态代码捕获该行,但替换该值将null 输入 +-----------+--
- 请注意:每个键只能有一个UV
- 请注意:UV时间戳之后可能有多个OA事件。我想要最接近UV时间戳的那个
null
+-----------+----------+-------------------+
|key |statusCode|statusTimestamp |
+-----------+----------+-------------------+
|AAAAAABBBBB|OA |2019-05-24 14:46:00|
|AAAAAABBBBB|VD |2019-05-31 19:31:00|
|AAAAAABBBBB|VA |2019-06-26 00:00:00|
|AAAAAABBBBB|E |2019-06-26 02:00:00|
|AAAAAABBBBB|UV |2019-06-29 00:00:00|
|AAAAAABBBBB|OA |2019-07-01 00:00:00|
|AAAAAABBBBB|EE |2019-07-03 01:00:00|
+-----------+----------+-------------------+
+-----------+----------+-------------------+
|key |statusCode|statusTimestamp |
+-----------+----------+-------------------+
|AAAAAABBBBB|UV |2019-06-29 00:00:00|
|AAAAAABBBBB|OA |2019-07-01 00:00:00|
+-----------+----------+-------------------+
预期产出
+-----------+----------+-------------------+
|key |statusCode|statusTimestamp |
+-----------+----------+-------------------+
|AAAAAABBBBB|OA |2019-05-24 14:46:00|
|AAAAAABBBBB|VD |2019-05-31 19:31:00|
|AAAAAABBBBB|VA |2019-06-26 00:00:00|
|AAAAAABBBBB|E |2019-06-26 02:00:00|
|AAAAAABBBBB|UV |2019-06-29 00:00:00|
|AAAAAABBBBB|OA |2019-07-01 00:00:00|
|AAAAAABBBBB|EE |2019-07-03 01:00:00|
+-----------+----------+-------------------+
+-----------+----------+-------------------+
|key |statusCode|statusTimestamp |
+-----------+----------+-------------------+
|AAAAAABBBBB|UV |2019-06-29 00:00:00|
|AAAAAABBBBB|OA |2019-07-01 00:00:00|
+-----------+----------+-------------------+
我知道我可以通过这样设置数据来解决这个问题,但是有没有人对如何解决上述过滤器有什么建议
someDS
.groupBy(“关键”)
.pivot(“状态代码”,序号(“UV”,“OA”))
.agg(收集集合($“状态时间戳”))
.然后是别的东西。。。
虽然groupBy/pivot
方法可以很好地对时间戳进行分组,但它需要非常简单的步骤(很可能是UDF)来执行必要的过滤,然后再进行重新扩展。下面是一种不同的方法,包括以下步骤:
statusCode
字符串Regex
模式匹配来识别所需的行import java.sql.Timestamp
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
// Sample data:
// key `A`: requirement #3
// key `B`: requirement #2
// key `C`: requirement #4
val df = Seq(
("A", "OA", Timestamp.valueOf("2019-05-20 00:00:00")),
("A", "E", Timestamp.valueOf("2019-05-30 00:00:00")),
("A", "UV", Timestamp.valueOf("2019-06-22 00:00:00")),
("A", "OA", Timestamp.valueOf("2019-07-01 00:00:00")),
("A", "OA", Timestamp.valueOf("2019-07-03 00:00:00")),
("B", "C", Timestamp.valueOf("2019-06-15 00:00:00")),
("B", "OA", Timestamp.valueOf("2019-06-25 00:00:00")),
("C", "D", Timestamp.valueOf("2019-06-01 00:00:00")),
("C", "OA", Timestamp.valueOf("2019-06-30 00:00:00")),
("C", "UV", Timestamp.valueOf("2019-07-02 00:00:00"))
).toDF("key", "statusCode", "statusTimestamp")
val win = Window.partitionBy("key").orderBy("statusTimestamp")
val df2 = df.
where($"statusCode" === "UV" || $"statusCode" === "OA").
withColumn("statusPrevCurrNext2", concat(
coalesce(lag($"statusCode", 1).over(win), lit("")),
lit("#"),
$"statusCode",
lit("#"),
coalesce(lead($"statusCode", 1).over(win), lit("")),
lit("#"),
coalesce(lead($"statusCode", 2).over(win), lit(""))
))
让我们看看df2
(步骤1
和2
的结果):