Scala Apache Spark基于另一行更新RDD或数据集中的一行_Scala_Apache Spark_Spark Dataframe_Rdd_Apache Spark Dataset

Scala Apache Spark基于另一行更新RDD或数据集中的一行

scala apache-spark

Scala Apache Spark基于另一行更新RDD或数据集中的一行,scala,apache-spark,spark-dataframe,rdd,apache-spark-dataset,Scala,Apache Spark,Spark Dataframe,Rdd,Apache Spark Dataset,我正试图找出如何根据另一行更新某些行例如，我有一些数据，如 Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 2, john, 4.0, montreal, ... 3, charles, 2.0, texas, ... 我想将同一城市的用户更新为相同的groupId（1或2）如何在RDD或数据集中实现这一点因此，为了完整性起见，如果Id是一个字符串

我正试图找出如何根据另一行更新某些行

例如，我有一些数据，如

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
2, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

我想将同一城市的用户更新为相同的groupId（1或2）

如何在RDD或数据集中实现这一点

因此，为了完整性起见，如果

Id

是一个字符串，那么密集列将不起作用

比如说

Id | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
b, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

结果如下所示：

grade | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
a, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

尝试：

一种干净的方法是使用

Window

函数中的

densite\u rank（）

。它枚举

窗口

列中的唯一值。因为

city

是一个

String

列，所以它们将按字母顺序递增

import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window

val df = spark.createDataFrame(Seq(
  (1, "philip", 2.0, "montreal"),
  (2, "john", 4.0, "montreal"),
  (3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city")

val w = Window.orderBy($"city")
df.withColumn("id", rank().over(w)).show()

+---+--------+------+--------+
| id|username|rating|    city|
+---+--------+------+--------+
|  1|  philip|   2.0|montreal|
|  1|    john|   4.0|montreal|
|  2| charles|   2.0|   texas|
+---+--------+------+--------+

我担心这不是分布式的，但在这里可能是可以的，所以请向上投票。@mtoto感谢您的解决方案，但只是想问一下，如果

id

是一个字符串，那么密集的秩将不起作用吗？在这种方法中不考虑现有的

id

列，它只是为

city

列的每个唯一值提供唯一的键。

df.select("city").distinct.withColumn("id", monotonically_increasing_id).join(df.drop("id"), Seq("city"))

import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window

val df = spark.createDataFrame(Seq(
  (1, "philip", 2.0, "montreal"),
  (2, "john", 4.0, "montreal"),
  (3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city")

val w = Window.orderBy($"city")
df.withColumn("id", rank().over(w)).show()

+---+--------+------+--------+
| id|username|rating|    city|
+---+--------+------+--------+
|  1|  philip|   2.0|montreal|
|  1|    john|   4.0|montreal|
|  2| charles|   2.0|   texas|
+---+--------+------+--------+