Scala Spark转换数据帧并从每行获取所有唯一ID及其类型
我有这样的数据帧:Scala Spark转换数据帧并从每行获取所有唯一ID及其类型,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我有这样的数据帧: +---------+---------+-------------+-------+ |device_id|master_id|time |user_id| +---------+---------+-------------+-------+ |X |M |1604609299000|A | |Z |M |1604609318000|A | |Y |N
+---------+---------+-------------+-------+
|device_id|master_id|time |user_id|
+---------+---------+-------------+-------+
|X |M |1604609299000|A |
|Z |M |1604609318000|A |
|Y |N |1604610161000|B |
+---------+---------+-------------+-------+
+---+-------------+---------+
|id |time |type |
+---+-------------+---------+
|A |1604609299000|user_id |
|X |1604609299000|device_id|
|M |1604609299000|master_id|
|Z |1604609318000|device_id|
|B |1604610161000|user_id |
|Y |1604610161000|device_id|
|N |1604610161000|master_id|
+---+-------------+---------+
我试图做的是获得所有唯一的ID和类型,以及我第一次看到它们时的状态。我想知道如何将上述数据帧转换为如下内容:
+---------+---------+-------------+-------+
|device_id|master_id|time |user_id|
+---------+---------+-------------+-------+
|X |M |1604609299000|A |
|Z |M |1604609318000|A |
|Y |N |1604610161000|B |
+---------+---------+-------------+-------+
+---+-------------+---------+
|id |time |type |
+---+-------------+---------+
|A |1604609299000|user_id |
|X |1604609299000|device_id|
|M |1604609299000|master_id|
|Z |1604609318000|device_id|
|B |1604610161000|user_id |
|Y |1604610161000|device_id|
|N |1604610161000|master_id|
+---+-------------+---------+
你的数据集有多大?这个答案可能效率不高,但会奏效 假设您的原始数据帧名称是someDF
someDF.createOrReplaceTempView("someDF")
someDF.printSchema
root
|-- device_id: string (nullable = true)
|-- master_id: string (nullable = true)
|-- time: integer (nullable = false)
|-- user_id: string (nullable = true)
someDF.show
+---------+---------+----------+-------+
|device_id|master_id| time|user_id|
+---------+---------+----------+-------+
| X| M|1604609299| A|
| Z| M|1604609318| A|
| Y| N|1604610161| B|
+---------+---------+----------+-------+
转换查询将类似于:
spark.sql("""select device_id as id ,min(time) as time,'device_id' as type from someDF group by device_id
union
select master_id as id, min(time) as time , 'master_id' as type from someDF group by master_id
union
select user_id as id, min(time) as time , 'user_id' as type from someDF group by user_id
""").sort(col("time"))
res13: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, time: int ... 1 more field]
res13.show
+---+----------+---------+
| id| time| type|
+---+----------+---------+
| X|1604609299|device_id|
| A|1604609299| user_id|
| M|1604609299|master_id|
| Z|1604609318|device_id|
| Y|1604610161|device_id|
| N|1604610161|master_id|
| B|1604610161| user_id|
+---+----------+---------+
在数组(struct())中添加所需的列,然后分解列数据 检查下面的代码 给定数据
scala> df.show(false)
+---------+---------+-------------+-------+
|device_id|master_id|time |user_id|
+---------+---------+-------------+-------+
|X |M |1604609299000|A |
|Z |M |1604609318000|A |
|Y |N |1604610161000|B |
+---------+---------+-------------+-------+
创建表达式
scala> val colExpr = array(
df
.columns
.filterNot(_ == "time")
.map(c =>
struct(
col(c).as("id"), // id column
col("time").as("time"), // time column
lit(c).as("type") // type column
)
):_*
)
以上代码的结果如下所示
colExpr: org.apache.spark.sql.Column = array(
named_struct(NamePlaceholder(), device_id AS `id`, NamePlaceholder(), time AS `time`, type, device_id AS `type`),
named_struct(NamePlaceholder(), master_id AS `id`, NamePlaceholder(), time AS `time`, type, master_id AS `type`),
named_struct(NamePlaceholder(), user_id AS `id`, NamePlaceholder(), time AS `time`, type, user_id AS `type`)
)
注意:下面的表达式仅用于理解上述代码如何转换为表达式。(不要执行下面的表达式。)
应用表达式
scala> df.select(explode(colExpr).as("data")).select("data.*").show(false)
+---+-------------+---------+
|id |time |type |
+---+-------------+---------+
|X |1604609299000|device_id|
|M |1604609299000|master_id|
|A |1604609299000|user_id |
|Z |1604609318000|device_id|
|M |1604609318000|master_id|
|A |1604609318000|user_id |
|Y |1604610161000|device_id|
|N |1604610161000|master_id|
|B |1604610161000|user_id |
+---+-------------+---------+
我认为您可以为每种类型创建3个单独的select,一个select查询,然后合并结果。大约有40万行。srinivas提供的解决方案似乎更有效。请确保在未排序的数据上运行Srinivas提供的解决方案。它可能会为您提供表中的第一次出现,而不是最小的时间戳。您能详细说明一下colExpr吗?我的scala编译器似乎不喜欢您的语法