Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark转换数据帧并从每行获取所有唯一ID及其类型_Scala_Dataframe_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala Spark转换数据帧并从每行获取所有唯一ID及其类型

Scala Spark转换数据帧并从每行获取所有唯一ID及其类型,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我有这样的数据帧: +---------+---------+-------------+-------+ |device_id|master_id|time |user_id| +---------+---------+-------------+-------+ |X |M |1604609299000|A | |Z |M |1604609318000|A | |Y |N

我有这样的数据帧:

+---------+---------+-------------+-------+
|device_id|master_id|time         |user_id|
+---------+---------+-------------+-------+
|X        |M        |1604609299000|A      |
|Z        |M        |1604609318000|A      |
|Y        |N        |1604610161000|B      |
+---------+---------+-------------+-------+
+---+-------------+---------+
|id |time         |type     |
+---+-------------+---------+
|A  |1604609299000|user_id  |
|X  |1604609299000|device_id|
|M  |1604609299000|master_id|
|Z  |1604609318000|device_id|
|B  |1604610161000|user_id  |
|Y  |1604610161000|device_id|
|N  |1604610161000|master_id|
+---+-------------+---------+
我试图做的是获得所有唯一的ID和类型,以及我第一次看到它们时的状态。我想知道如何将上述数据帧转换为如下内容:

+---------+---------+-------------+-------+
|device_id|master_id|time         |user_id|
+---------+---------+-------------+-------+
|X        |M        |1604609299000|A      |
|Z        |M        |1604609318000|A      |
|Y        |N        |1604610161000|B      |
+---------+---------+-------------+-------+
+---+-------------+---------+
|id |time         |type     |
+---+-------------+---------+
|A  |1604609299000|user_id  |
|X  |1604609299000|device_id|
|M  |1604609299000|master_id|
|Z  |1604609318000|device_id|
|B  |1604610161000|user_id  |
|Y  |1604610161000|device_id|
|N  |1604610161000|master_id|
+---+-------------+---------+

你的数据集有多大?这个答案可能效率不高,但会奏效

假设您的原始数据帧名称是someDF

someDF.createOrReplaceTempView("someDF")

someDF.printSchema
    root
     |-- device_id: string (nullable = true)
     |-- master_id: string (nullable = true)
     |-- time: integer (nullable = false)
     |-- user_id: string (nullable = true)

someDF.show
    +---------+---------+----------+-------+
    |device_id|master_id|      time|user_id|
    +---------+---------+----------+-------+
    |        X|        M|1604609299|      A|
    |        Z|        M|1604609318|      A|
    |        Y|        N|1604610161|      B|
    +---------+---------+----------+-------+
转换查询将类似于:

spark.sql("""select device_id as id ,min(time) as time,'device_id' as type from someDF group by device_id
union
select master_id as id, min(time) as time , 'master_id' as type from someDF group by master_id
union
select user_id as id,  min(time) as time , 'user_id' as type from someDF group by user_id
""").sort(col("time"))

res13: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, time: int ... 1 more field]

res13.show
    +---+----------+---------+
    | id|      time|     type|
    +---+----------+---------+
    |  X|1604609299|device_id|
    |  A|1604609299|  user_id|
    |  M|1604609299|master_id|
    |  Z|1604609318|device_id|
    |  Y|1604610161|device_id|
    |  N|1604610161|master_id|
    |  B|1604610161|  user_id|
    +---+----------+---------+

在数组(struct())中添加所需的列,然后分解列数据

检查下面的代码

给定数据

scala> df.show(false)
+---------+---------+-------------+-------+
|device_id|master_id|time         |user_id|
+---------+---------+-------------+-------+
|X        |M        |1604609299000|A      |
|Z        |M        |1604609318000|A      |
|Y        |N        |1604610161000|B      |
+---------+---------+-------------+-------+
创建表达式

scala> val colExpr = array(
        df
        .columns
        .filterNot(_ == "time")
        .map(c => 
                struct(
                    col(c).as("id"), // id column
                    col("time").as("time"), // time column
                    lit(c).as("type") // type column
                )
            ):_*
    )
以上代码的结果如下所示

colExpr: org.apache.spark.sql.Column = array(
    named_struct(NamePlaceholder(), device_id AS `id`, NamePlaceholder(), time AS `time`, type, device_id AS `type`), 
    named_struct(NamePlaceholder(), master_id AS `id`, NamePlaceholder(), time AS `time`, type, master_id AS `type`), 
    named_struct(NamePlaceholder(), user_id AS `id`, NamePlaceholder(), time AS `time`, type, user_id AS `type`)
)
注意:下面的表达式仅用于理解上述代码如何转换为表达式。(不要执行下面的表达式。)

应用表达式

scala> df.select(explode(colExpr).as("data")).select("data.*").show(false)
+---+-------------+---------+
|id |time         |type     |
+---+-------------+---------+
|X  |1604609299000|device_id|
|M  |1604609299000|master_id|
|A  |1604609299000|user_id  |
|Z  |1604609318000|device_id|
|M  |1604609318000|master_id|
|A  |1604609318000|user_id  |
|Y  |1604610161000|device_id|
|N  |1604610161000|master_id|
|B  |1604610161000|user_id  |
+---+-------------+---------+

我认为您可以为每种类型创建3个单独的select,一个select查询,然后合并结果。大约有40万行。srinivas提供的解决方案似乎更有效。请确保在未排序的数据上运行Srinivas提供的解决方案。它可能会为您提供表中的第一次出现,而不是最小的时间戳。您能详细说明一下colExpr吗?我的scala编译器似乎不喜欢您的语法