Apache spark 在pyspark数据帧中透视列和分组的有效方法
我在pyspark中有一个数据帧,如下所示Apache spark 在pyspark数据帧中透视列和分组的有效方法,apache-spark,pyspark,Apache Spark,Pyspark,我在pyspark中有一个数据帧,如下所示 df = spark.createDataFrame([(1,'ios',11,'null'), (1,'ios',12,'null'), (1,'ios',13,'null'), (1,'ios',14,'null'), (1,'
df = spark.createDataFrame([(1,'ios',11,'null'),
(1,'ios',12,'null'),
(1,'ios',13,'null'),
(1,'ios',14,'null'),
(1,'android',15,'ok'),
(1,'android',16,'not ok'),
(1,'android',17,'aborted'),
(2,'ios',21,'not ok'),
(2,'android',18,'aborted'),
(3,'android',18,'null')],
['id','type','s_id','state'])
df.show()
+---+-------+----+-------+
| id| type|s_id| state|
+---+-------+----+-------+
| 1| ios| 11| null|
| 1| ios| 12| null|
| 1| ios| 13| null|
| 1| ios| 14| null|
| 1|android| 15| ok|
| 1|android| 16| not_ok|
| 1|android| 17|aborted|
| 2| ios| 21| not_ok|
| 2|android| 18|aborted|
| 3|android| 18| null|
+---+-------+----+-------+
现在,从这个数据帧开始,我想通过旋转它来创建另一个数据帧。
我做了如下工作:
from pyspark.sql import Window
from pyspark.sql import functions as f
from pyspark.sql.functions import col, first
windowSpec = Window.partitionBy("id", "type")
df1 = df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.filter(f.col("type") != "")\
.withColumn("type", f.concat(f.col("type"),
f.col("ranks"))).drop("ranks")\
.groupBy("id").pivot("type").agg(f.first("s_id"))
df1.show()
+---+--------+--------+--------+----+----+----+
| id|android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
| 1| 15| 16| 17| 11| 12| 13|
| 2| 18| null| null| 21|null|null|
| 3| 18| null| null|null|null|null|
+---+--------+--------+--------+----+----+----+
连接df1和df2
final_df = df1.join(df2, 'id', 'left_outer')
final_df.show()
+---+--------+--------+--------+----+----+----+------+
| id|android1|android2|android3|ios1|ios2|ios3| first|
+---+--------+--------+--------+----+----+----+------+
| 1| 15| 16| 17| 11| 12| 13| ok|
| 2| 18| null| null| 21|null|null|not_ok|
| 3| 18| null| null|null|null|null| null|
+---+--------+--------+--------+----+----+----+------+
我得到了我想要的,但我想知道是否有其他有效的方法来做到这一点。也许,有一些更有效的方法:
# Compute order of apparition os type
w = Window.partitionBy('id','type').orderBy('s_id')
df = df.withColumn('order',F.rank().over(w))
# Concatenate columns
df = df.withColumn('type',F.concat(F.col('type'),
F.col('order'))).drop('order')
df.show()
+---+--------+----+-------+
| id| type|s_id| state|
+---+--------+----+-------+
| 1| ios1| 11| null|
| 1| ios2| 12| null|
| 1| ios3| 13| null|
| 1| ios4| 14| null|
| 3|android1| 18| null|
| 2| ios1| 21| not ok|
| 2|android1| 18|aborted|
| 1|android1| 15| ok|
| 1|android2| 16| not ok|
| 1|android3| 17|aborted|
+---+--------+----+-------+
然后旋转数据帧,只保留3个前os_类型列:
# Chose number of cols you want
n_type = 3
l_col=['ios'+str(i+1) for i in range(n_type)]+['android'+str(i+1) for i in range(n_type)]
df = df.groupBy('id').pivot('type').agg({'s_id':'max'}).orderBy('id').select(*l_col)
df.show()
+---+--------+--------+--------+----+----+----+
| id|android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
| 1| 15| 16| 17| 11| 12| 13|
| 2| 18| null| null| 21|null|null|
| 3| 18| null| null|null|null|null|
+---+--------+--------+--------+----+----+----+
然后使用您的方法连接并添加最后一列
编辑:我添加了一个列列表,以仅选择所需的列
df=df.groupBy('id').pivot('type').agg({'s_id':'max'}).orderBy('id').drop('ios4','android4'))
在此语句中,如果我要删除10多列,那么输入列名将是一项乏味的工作。mauallyI刚刚编辑了答案,提供了一种自定义要选择的列的方法
# Chose number of cols you want
n_type = 3
l_col=['ios'+str(i+1) for i in range(n_type)]+['android'+str(i+1) for i in range(n_type)]
df = df.groupBy('id').pivot('type').agg({'s_id':'max'}).orderBy('id').select(*l_col)
df.show()
+---+--------+--------+--------+----+----+----+
| id|android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
| 1| 15| 16| 17| 11| 12| 13|
| 2| 18| null| null| 21|null|null|
| 3| 18| null| null|null|null|null|
+---+--------+--------+--------+----+----+----+