Scala—数据帧的条件替换列值
DataFrame 1就是我现在拥有的,我想编写一个Scala函数,使DataFrame 1看起来像DataFrame 2 转移是一大类;e-transfer和IMT属于子类别 逻辑是,对于同一ID(31898),如果传输和e-Transfer都标记到该ID,则该ID应仅为e-Transfer;如果Transfer和IMT以及e-Transfer都标记为同一ID(32614),则应为e-Transfer+IMT;如果仅将传输标记为一个ID(33987),则应为另一个ID;如果仅将e-Transfer或IMT标记为ID(34193),则它应仅为e-Transfer pr IMT scala新手,不知道如何编写一个好的函数来实现这一点。请帮忙Scala—数据帧的条件替换列值,scala,apache-spark,dataframe,user-defined-functions,Scala,Apache Spark,Dataframe,User Defined Functions,DataFrame 1就是我现在拥有的,我想编写一个Scala函数,使DataFrame 1看起来像DataFrame 2 转移是一大类;e-transfer和IMT属于子类别 逻辑是,对于同一ID(31898),如果传输和e-Transfer都标记到该ID,则该ID应仅为e-Transfer;如果Transfer和IMT以及e-Transfer都标记为同一ID(32614),则应为e-Transfer+IMT;如果仅将传输标记为一个ID(33987),则应为另一个ID;如果仅将e-Transf
DataFrame 1 DataFrame 2
+---------+-------------+ +---------+------------------+
| ID | Category | | ID | Category |
+---------+-------------+ +---------+------------------+
| 31898 | Transfer | | 31898 | e-Transfer |
| 31898 | e-Transfer | | 32614 | e-Transfer + IMT|
| 32614 | Transfer | =====> | 33987 | Other |
| 32614 | e-Transfer | =====> | 34193 | e-Transfer |
| 32614 | IMT | +---------+------------------+
| 33987 | Transfer |
| 34193 | e-Transfer |
+---------+-------------+
您可以按
ID
对数据帧进行分组,使用collect\u set
聚合Category
以组合类别数组,并使用array\u contains
基于类别数组中的内容创建新列:
import org.apache.spark.sql.functions._
val df = Seq(
(31898, "Transfer"),
(31898, "e-Transfer"),
(32614, "Transfer"),
(32614, "e-Transfer"),
(32614, "IMT"),
(33987, "Transfer"),
(34193, "e-Transfer")
).toDF("ID", "Category")
df.groupBy("ID").agg(collect_set("Category").as("CategorySet")).
withColumn( "Category",
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "IMT"),
"e-Transfer + IMT").otherwise(
when(array_contains($"CategorySet", "e-Transfer") && array_contains($"CategorySet", "Transfer"),
"e-Transfer").otherwise(
when($"CategorySet" === Array("e-Transfer") || $"CategorySet" === Array("MIT"),
$"CategorySet"(0)).otherwise(
when($"CategorySet" === Array("Transfer"), "Other")
)))
).
show(false)
// +-----+---------------------------+----------------+
// |ID |CategorySet |Category |
// +-----+---------------------------+----------------+
// |33987|[Transfer] |Other |
// |32614|[Transfer, e-Transfer, IMT]|e-Transfer + IMT|
// |34193|[e-Transfer] |e-Transfer |
// |31898|[Transfer, e-Transfer] |e-Transfer |
// +-----+---------------------------+----------------+
您的示例数据可能未涵盖所有情况(例如,[Transfer,MIT]
)。现有示例代码将为任何剩余案例生成null
类别值。如果确定了其他情况,只需修改/扩展条件检查