Apache spark PYSPARK中的数据帧透视

Apache spark PYSPARK中的数据帧透视,apache-spark,pivot,apache-spark-sql,pyspark-sql,Apache Spark,Pivot,Apache Spark Sql,Pyspark Sql,我有如下要求 数据帧 id code R101,GTR001 R201,RTY987 R301,KIT158 R201,PLI564 R101,MJU098 R301,OUY579 每个id可以有许多代码,而不仅仅是两个 预期输出应如下所示 id col1 col2 col3 col4 col5 col6 R101 GTR001 MJU098 null null null null R201 null null RTY987 PLI564

我有如下要求

数据帧

id   code
R101,GTR001
R201,RTY987
R301,KIT158
R201,PLI564
R101,MJU098
R301,OUY579
每个id可以有许多代码,而不仅仅是两个

预期输出应如下所示

id    col1  col2   col3   col4   col5   col6

R101 GTR001 MJU098 null   null   null   null   
R201 null   null   RTY987 PLI564 null   null   
R301 null   null   null   null   KIT158 OUY579

此处,特定id的列取决于分配给该id的代码数,即应填充R101的col1和col2下的代码,应填充R201的col3和col4下的代码,其余id也是如此。

您可以尝试根据id对代码字段进行排序,并使用排序进行数据透视。希望这对您有所帮助

 df = spark.createDataFrame([('R101','GTR001'),('R201','RTY987'),('R301','KIT158'),('R201','PLI564'),('R101','MJU098'),('R301','OUY579')],['id','code'])
 df.show()
   +----+------+
   |  id|  code|
   +----+------+
   |R101|GTR001|
   |R201|RTY987|
   |R301|KIT158|
   |R201|PLI564|
   |R101|MJU098|
   |R301|OUY579|
   +----+------+

 from pyspark.sql import functions as F
 from pyspark.sql import Window

 df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","code")))
 df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('code')).show()

    +----+------+------+------+------+------+------+
   |  id| col_1| col_2| col_3| col_4| col_5| col_6|
   +----+------+------+------+------+------+------+
   |R101|GTR001|MJU098|  null|  null|  null|  null|
   |R201|  null|  null|PLI564|RTY987|  null|  null|
   |R301|  null|  null|  null|  null|KIT158|OUY579|
   +----+------+------+------+------+------+------+ 

您可以尝试根据id对代码字段进行排名,并使用排名进行数据透视。希望这对您有所帮助

 df = spark.createDataFrame([('R101','GTR001'),('R201','RTY987'),('R301','KIT158'),('R201','PLI564'),('R101','MJU098'),('R301','OUY579')],['id','code'])
 df.show()
   +----+------+
   |  id|  code|
   +----+------+
   |R101|GTR001|
   |R201|RTY987|
   |R301|KIT158|
   |R201|PLI564|
   |R101|MJU098|
   |R301|OUY579|
   +----+------+

 from pyspark.sql import functions as F
 from pyspark.sql import Window

 df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","code")))
 df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('code')).show()

    +----+------+------+------+------+------+------+
   |  id| col_1| col_2| col_3| col_4| col_5| col_6|
   +----+------+------+------+------+------+------+
   |R101|GTR001|MJU098|  null|  null|  null|  null|
   |R201|  null|  null|PLI564|RTY987|  null|  null|
   |R301|  null|  null|  null|  null|KIT158|OUY579|
   +----+------+------+------+------+------+------+ 

Let us可能重复。同一代码(例如GTR001)能否分配给两个或多个ID?无Jacek,代码是唯一的。Let us可能重复。同一代码(例如GTR001)能否分配给两个或多个ID?无Jacek,代码是唯一的