Apache spark PYSPARK中的数据帧透视
我有如下要求 数据帧Apache spark PYSPARK中的数据帧透视,apache-spark,pivot,apache-spark-sql,pyspark-sql,Apache Spark,Pivot,Apache Spark Sql,Pyspark Sql,我有如下要求 数据帧 id code R101,GTR001 R201,RTY987 R301,KIT158 R201,PLI564 R101,MJU098 R301,OUY579 每个id可以有许多代码,而不仅仅是两个 预期输出应如下所示 id col1 col2 col3 col4 col5 col6 R101 GTR001 MJU098 null null null null R201 null null RTY987 PLI564
id code
R101,GTR001
R201,RTY987
R301,KIT158
R201,PLI564
R101,MJU098
R301,OUY579
每个id可以有许多代码,而不仅仅是两个
预期输出应如下所示
id col1 col2 col3 col4 col5 col6
R101 GTR001 MJU098 null null null null
R201 null null RTY987 PLI564 null null
R301 null null null null KIT158 OUY579
此处,特定id的列取决于分配给该id的代码数,即应填充R101的col1和col2下的代码,应填充R201的col3和col4下的代码,其余id也是如此。您可以尝试根据id对代码字段进行排序,并使用排序进行数据透视。希望这对您有所帮助
df = spark.createDataFrame([('R101','GTR001'),('R201','RTY987'),('R301','KIT158'),('R201','PLI564'),('R101','MJU098'),('R301','OUY579')],['id','code'])
df.show()
+----+------+
| id| code|
+----+------+
|R101|GTR001|
|R201|RTY987|
|R301|KIT158|
|R201|PLI564|
|R101|MJU098|
|R301|OUY579|
+----+------+
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","code")))
df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('code')).show()
+----+------+------+------+------+------+------+
| id| col_1| col_2| col_3| col_4| col_5| col_6|
+----+------+------+------+------+------+------+
|R101|GTR001|MJU098| null| null| null| null|
|R201| null| null|PLI564|RTY987| null| null|
|R301| null| null| null| null|KIT158|OUY579|
+----+------+------+------+------+------+------+
您可以尝试根据id对代码字段进行排名,并使用排名进行数据透视。希望这对您有所帮助
df = spark.createDataFrame([('R101','GTR001'),('R201','RTY987'),('R301','KIT158'),('R201','PLI564'),('R101','MJU098'),('R301','OUY579')],['id','code'])
df.show()
+----+------+
| id| code|
+----+------+
|R101|GTR001|
|R201|RTY987|
|R301|KIT158|
|R201|PLI564|
|R101|MJU098|
|R301|OUY579|
+----+------+
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","code")))
df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('code')).show()
+----+------+------+------+------+------+------+
| id| col_1| col_2| col_3| col_4| col_5| col_6|
+----+------+------+------+------+------+------+
|R101|GTR001|MJU098| null| null| null| null|
|R201| null| null|PLI564|RTY987| null| null|
|R301| null| null| null| null|KIT158|OUY579|
+----+------+------+------+------+------+------+
Let us可能重复。同一代码(例如GTR001)能否分配给两个或多个ID?无Jacek,代码是唯一的。Let us可能重复。同一代码(例如GTR001)能否分配给两个或多个ID?无Jacek,代码是唯一的