Scala 如何为其他列创建事件序列(列值)?

Scala 如何为其他列创建事件序列(列值)?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个Spark数据框,如下所示- val myDF = Seq( (1,"A",100,0,0), (1,"E",200,0,0), (1,"",300,1,49), (2,"A",200,0,0), (2,"C",300,0,0), (2,"D",100,0,0) ).toDF("visitor","channel","timestamp","purchase_flag","amount") scala> myDF.show +-------+-------+---------+

我有一个Spark数据框,如下所示-

val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")


scala> myDF.show
+-------+-------+---------+-------------+------+
|visitor|channel|timestamp|purchase_flag|amount|
+-------+-------+---------+-------------+------+
|      1|      A|      100|            0|     0|
|      1|      E|      200|            0|     0|
|      1|       |      300|            1|    49|
|      2|      A|      200|            0|     0|
|      2|      C|      300|            0|     0|
|      2|      D|      100|            0|     0|
+-------+-------+---------+-------------+------+
我想为来自
myDF
的每个访问者创建序列数据框,该序列数据框跟踪访问者的路径,以按
时间戳
维度订购。 输出数据框应如下所示(
->
可以是任何分隔符)-

为了清楚起见,访客
2
已经接触到频道
D
,然后是
A
,然后是
C
;他也不买东西。 因此,序列将形成为
D->A-C->no\u purchase

注意:每当购买发生时,频道值变为
空白
,并且
购买标志
设置为1


我想在Spark中使用
Scala UDF
来实现这一点,以便在其他数据集上重新应用该方法。

下面是如何使用
UDF
函数实现的

val myDF = Seq(
  (1,"A",100,0,0),
  (1,"E",200,0,0),
  (1,"",300,1,49),
  (2,"A",200,0,0),
  (2,"C",300,0,0),
  (2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")

import org.apache.spark.sql.functions._
def sequenceUdf = udf((struct: Seq[Row], purchased: Seq[Int])=> struct.map(row => (row.getAs[String]("channel"), row.getAs[Int]("timestamp"))).sortBy(_._2).map(_._1).filterNot(_ == "").mkString("->")+{if(purchased.contains(1)) "->purchase" else "->no_purchase"})

myDF.groupBy("visitor").agg(collect_list(struct("channel", "timestamp")).as("struct"), collect_list("purchase_flag").as("purchased"))
  .select(col("visitor"), sequenceUdf(col("struct"), col("purchased")).as("channel sequence"))
  .show(false)
应该给你什么

+-------+--------------------+
|visitor|channel sequence    |
+-------+--------------------+
|1      |A->E->purchase      |
|2      |D->A->C->no_purchase|
+-------+--------------------+
你可以让它尽可能的通用。这只是一个演示如何继续

+-------+--------------------+
|visitor|channel sequence    |
+-------+--------------------+
|1      |A->E->purchase      |
|2      |D->A->C->no_purchase|
+-------+--------------------+