Apache spark Pyspark agg函数为;“爆炸”;行到列
基本上,我有这样一个数据帧:Apache spark Pyspark agg函数为;“爆炸”;行到列,apache-spark,pyspark,Apache Spark,Pyspark,基本上,我有这样一个数据帧: +----+-------+------+------+ | id | index | col1 | col2 | +----+-------+------+------+ | 1 | a | a11 | a12 | +----+-------+------+------+ | 1 | b | b11 | b12 | +----+-------+------+------+ | 2 | a | a21 | a22 | +--
+----+-------+------+------+
| id | index | col1 | col2 |
+----+-------+------+------+
| 1 | a | a11 | a12 |
+----+-------+------+------+
| 1 | b | b11 | b12 |
+----+-------+------+------+
| 2 | a | a21 | a22 |
+----+-------+------+------+
| 2 | b | b21 | b22 |
+----+-------+------+------+
我想要的结果是:
+----+--------+--------+--------+--------+
| id | col1_a | col1_b | col2_a | col2_b |
+----+--------+--------+--------+--------+
| 1 | a11 | b11 | a12 | b12 |
+----+--------+--------+--------+--------+
| 2 | a21 | b21 | a22 | b22 |
+----+--------+--------+--------+--------+
因此,基本上我想在groupbyid
之后将索引
列“分解”为新列。顺便说一句,id
计数是相同的,每个id
具有相同的索引值集。我用的是pyspark
使用pivot可以实现所需的输出
使用pivot
df3 =df.groupBy("id").pivot("index").agg(F.first(F.col("col1")),F.first(F.col("col2")))
collist=["id","col1_a","col2_a","col1_b","col2_b"]
重命名列
df3.toDF(*collist).show()
+---+------+------+------+------+
| id|col1_a|col2_a|col1_b|col2_b|
+---+------+------+------+------+
| 1| a11| a12| b11| b12|
| 2| a21| a22| b21| b22|
+---+------+------+------+------+
注意根据您的要求重新排列列 是否要硬编码列名称col1_a和col1_b,或者它将是动态的,并且取决于索引的不同值?它应该取决于索引的不同值
df3.toDF(*collist).show()
+---+------+------+------+------+
| id|col1_a|col2_a|col1_b|col2_b|
+---+------+------+------+------+
| 1| a11| a12| b11| b12|
| 2| a21| a22| b21| b22|
+---+------+------+------+------+