Apache spark 如何计算匹配相关条件的行数?
我有一张这样的桌子:Apache spark 如何计算匹配相关条件的行数?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一张这样的桌子: TripID | Name | State 1 | John | OH 2 | John | OH 3 | John | CA 4 | John | OH 1 | Mike | CA 2 | Mike | CA 3 | Mike | OH 我想数一数先去俄亥俄州,然后去加利福尼亚州的人 在上面的例子中,答
TripID | Name | State
1 | John | OH
2 | John | OH
3 | John | CA
4 | John | OH
1 | Mike | CA
2 | Mike | CA
3 | Mike | OH
我想数一数先去俄亥俄州,然后去加利福尼亚州的人
在上面的例子中,答案应该是1
因此,我想知道如何在SQL筛选中设置特定顺序以筛选结果?我可能误解了您的问题,但如果您询问: 有多少人先去俄亥俄州,然后去加利福尼亚州 (草图)查询可以如下所示:
scala> trips.show
+------+----+-----+
|tripid|name|state|
+------+----+-----+
| 1|John| OH|
| 2|John| OH|
| 3|John| CA|
| 4|John| OH|
| 1|Mike| CA|
| 2|Mike| CA|
| 3|Mike| OH|
+------+----+-----+
scala> trips.orderBy("name", "tripid").groupBy("name").agg(collect_list("state")).show
+----+-------------------+
|name|collect_list(state)|
+----+-------------------+
|John| [OH, OH, CA, OH]|
|Mike| [CA, CA, OH]|
+----+-------------------+
在我看来,你有两个选择:
collect\u list
替换为包含不同状态的行程)collect\u list
收集了值之后)分解
和/或窗口
)groupBy
并不是必需的(!)您可以单独使用窗口聚合(使用两次)来处理它
如果您没有指定排序的列,则第一个是任意的。嘿@vkp您能在这里更具体一些吗?
import org.apache.spark.sql.expressions.Window
val byName = Window.partitionBy("name").orderBy("tripid")
val distinctStates = trips.withColumn("rank", rank over byName).dropDuplicates("name", "state").orderBy("name", "rank")
scala> distinctStates.show
+------+----+-----+----+
|tripid|name|state|rank|
+------+----+-----+----+
| 1|John| OH| 1|
| 3|John| CA| 3|
| 1|Mike| CA| 1|
| 3|Mike| OH| 3|
+------+----+-----+----+
// rank again but this time use the pre-calculated distinctStates dataset
val distinctStatesRanked = distinctStates.withColumn("rank", rank over byName).orderBy("name", "rank")
scala> distinctStatesRanked.show
+------+----+-----+----+
|tripid|name|state|rank|
+------+----+-----+----+
| 1|John| OH| 1|
| 3|John| CA| 2|
| 1|Mike| CA| 1|
| 3|Mike| OH| 2|
+------+----+-----+----+
val left = distinctStatesRanked.filter($"state" === "OH").filter($"rank" === 1)
val right = distinctStatesRanked.filter($"state" === "CA").filter($"rank" === 2)
scala> left.join(right, "name").show
+----+------+-----+----+------+-----+----+
|name|tripid|state|rank|tripid|state|rank|
+----+------+-----+----+------+-----+----+
|John| 1| OH| 1| 3| CA| 2|
+----+------+-----+----+------+-----+----+