Apache spark 在pyspark数据帧中透视列和分组的有效方法_Apache Spark_Pyspark

Apache spark 在pyspark数据帧中透视列和分组的有效方法

apache-spark pyspark

Apache spark 在pyspark数据帧中透视列和分组的有效方法,apache-spark,pyspark,Apache Spark,Pyspark,我在pyspark中有一个数据帧，如下所示 df = spark.createDataFrame([(1,'ios',11,'null'), (1,'ios',12,'null'), (1,'ios',13,'null'), (1,'ios',14,'null'), (1,'

我在pyspark中有一个数据帧，如下所示

df = spark.createDataFrame([(1,'ios',11,'null'),
                            (1,'ios',12,'null'),
                            (1,'ios',13,'null'),
                            (1,'ios',14,'null'),
                            (1,'android',15,'ok'),
                            (1,'android',16,'not ok'),
                            (1,'android',17,'aborted'),
                            (2,'ios',21,'not ok'),
                            (2,'android',18,'aborted'),
                            (3,'android',18,'null')],
                           ['id','type','s_id','state'])

df.show()
+---+-------+----+-------+
| id|   type|s_id|  state|
+---+-------+----+-------+
|  1|    ios|  11|   null|
|  1|    ios|  12|   null|
|  1|    ios|  13|   null|
|  1|    ios|  14|   null|
|  1|android|  15|     ok|
|  1|android|  16| not_ok|
|  1|android|  17|aborted|
|  2|    ios|  21| not_ok|
|  2|android|  18|aborted|
|  3|android|  18|   null|
+---+-------+----+-------+

现在，从这个数据帧开始，我想通过旋转它来创建另一个数据帧。
我做了如下工作：

from pyspark.sql import Window
from pyspark.sql import functions as f
from pyspark.sql.functions import col, first

windowSpec = Window.partitionBy("id", "type")

df1 = df.withColumn("ranks", f.row_number().over(windowSpec))\
        .filter(f.col("ranks") < 4)\
        .filter(f.col("type") != "")\
        .withColumn("type", f.concat(f.col("type"), 
                    f.col("ranks"))).drop("ranks")\
        .groupBy("id").pivot("type").agg(f.first("s_id"))


df1.show()
+---+--------+--------+--------+----+----+----+
| id|android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|  1|      15|      16|      17|  11|  12|  13|
|  2|      18|    null|    null|  21|null|null|
|  3|      18|    null|    null|null|null|null|
+---+--------+--------+--------+----+----+----+

连接df1和df2

final_df = df1.join(df2, 'id', 'left_outer')

final_df.show()

+---+--------+--------+--------+----+----+----+------+
| id|android1|android2|android3|ios1|ios2|ios3| first|
+---+--------+--------+--------+----+----+----+------+
|  1|      15|      16|      17|  11|  12|  13|    ok|
|  2|      18|    null|    null|  21|null|null|not_ok|
|  3|      18|    null|    null|null|null|null|  null|
+---+--------+--------+--------+----+----+----+------+

我得到了我想要的，但我想知道是否有其他有效的方法来做到这一点。

也许，有一些更有效的方法：

# Compute order of apparition os type
w = Window.partitionBy('id','type').orderBy('s_id')
df = df.withColumn('order',F.rank().over(w))

# Concatenate columns
df = df.withColumn('type',F.concat(F.col('type'),
                                   F.col('order'))).drop('order')
df.show()

+---+--------+----+-------+
| id|    type|s_id|  state|
+---+--------+----+-------+
|  1|    ios1|  11|   null|
|  1|    ios2|  12|   null|
|  1|    ios3|  13|   null|
|  1|    ios4|  14|   null|
|  3|android1|  18|   null|
|  2|    ios1|  21| not ok|
|  2|android1|  18|aborted|
|  1|android1|  15|     ok|
|  1|android2|  16| not ok|
|  1|android3|  17|aborted|
+---+--------+----+-------+

然后旋转数据帧，只保留3个前os_类型列：

# Chose number of cols you want
n_type = 3
l_col=['ios'+str(i+1) for i in range(n_type)]+['android'+str(i+1) for i in range(n_type)]

df = df.groupBy('id').pivot('type').agg({'s_id':'max'}).orderBy('id').select(*l_col)
df.show()

+---+--------+--------+--------+----+----+----+
| id|android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|  1|      15|      16|      17|  11|  12|  13|
|  2|      18|    null|    null|  21|null|null|
|  3|      18|    null|    null|null|null|null|
+---+--------+--------+--------+----+----+----+

然后使用您的方法连接并添加最后一列

编辑：我添加了一个列列表，以仅选择所需的列

df=df.groupBy（'id'）.pivot（'type'）.agg（{'s_id'：'max'}）.orderBy（'id'）.drop（'ios4'，'android4'））

在此语句中，如果我要删除10多列，那么输入列名将是一项乏味的工作。mauallyI刚刚编辑了答案，提供了一种自定义要选择的列的方法

# Chose number of cols you want
n_type = 3
l_col=['ios'+str(i+1) for i in range(n_type)]+['android'+str(i+1) for i in range(n_type)]

df = df.groupBy('id').pivot('type').agg({'s_id':'max'}).orderBy('id').select(*l_col)
df.show()

+---+--------+--------+--------+----+----+----+
| id|android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|  1|      15|      16|      17|  11|  12|  13|
|  2|      18|    null|    null|  21|null|null|
|  3|      18|    null|    null|null|null|null|
+---+--------+--------+--------+----+----+----+