Pyspark 旋转所有行的Pypark

Pyspark 旋转所有行的Pypark,pyspark,pivot,aws-glue,Pyspark,Pivot,Aws Glue,我有一个要透视的数据帧,我要透视所有行,而不仅仅是max、min或first: df = spark.createDataFrame( [ ('sequence', '@short_read_5/1', '2aa897b7-38cd-4af...', '2020-10-08-12'), ('desc', '+', '2aa897b7-38cd-4af...', '2020-10-08-12'), ('raw_sequence', 'GTTA

我有一个要透视的数据帧,我要透视所有行,而不仅仅是max、min或first:

df = spark.createDataFrame(
    [
        ('sequence', '@short_read_5/1', '2aa897b7-38cd-4af...', '2020-10-08-12'),
        ('desc', '+', '2aa897b7-38cd-4af...', '2020-10-08-12'),
        ('raw_sequence', 'GTTACTTCGATATCCGC...', '2aa897b7-38cd-4af...', '2020-10-08-12'),
        ('quality', 'BCCC<EGAGGGGGGGGG...', '2aa897b7-38cd-4af...', '2020-10-08-12'),
        ('sequence', '@short_read_5/2', '2aa897b7-38cd-4af...', '2020-10-08-12'),
        ('desc', '+', '2aa897b7-38cd-4af...', '2020-10-08-12'),
        ('raw_sequence', 'dTTACTTCGATATCCGC...', '2aa897b7-38cd-4af...', '2020-10-08-12'),
        ('quality', 'dCCC<EGAGGGGGGGGG...', '2aa897b7-38cd-4af...', '2020-10-08-12'),
        ],
    ['col', 'data', 'metadata_key', 'datehour']
    )

+------------+--------------------+--------------------+-------------+
|         col|                data|        metadata_key|     datehour|
+------------+--------------------+--------------------+-------------+
|    sequence|     @short_read_5/1|2aa897b7-38cd-4af...|2020-10-08-12|
|        desc|                   +|2aa897b7-38cd-4af...|2020-10-08-12|
|raw_sequence|GTTACTTCGATATCCGC...|2aa897b7-38cd-4af...|2020-10-08-12|
|     quality|BCCC<EGAGGGGGGGGG...|2aa897b7-38cd-4af...|2020-10-08-12|
|    sequence|     @short_read_5/2|2aa897b7-38cd-4af...|2020-10-08-12|
|        desc|                   +|2aa897b7-38cd-4af...|2020-10-08-12|
|raw_sequence|dTTACTTCGATATCCGC...|2aa897b7-38cd-4af...|2020-10-08-12|
|     quality|dCCC<EGAGGGGGGGGG...|2aa897b7-38cd-4af...|2020-10-08-12|
+------------+--------------------+--------------------+-------------+
df=spark.createDataFrame(
[
(“序列”、“短读”5/1、“2aa897b7-38cd-4af…”、“2020-10-08-12”),
(“说明”、“说明”、“2aa897b7-38cd-4af…”、“2020-10-08-12”),
(‘原始序列’、‘GTTACTCGATCCGC…’、‘2aa897b7-38cd-4af…’、‘2020-10-08-12’),

('quality'、'BCCCWhy
dtacttcgatatccgc
@short\u read\u 5/2
在同一行上?逻辑是什么?如果你不能回答这个简单的问题,这就是为什么你最后只有一行的原因。
df.groupBy('metadata_key', 'datehour').pivot("col", ["sequence", "desc", "raw_sequence", "quality"]).agg(F.first(F.col("data"))).show()

+--------------------+-------------+---------------+----+--------------------+--------------------+
|        metadata_key|     datehour|       sequence|desc|        raw_sequence|             quality|
+--------------------+-------------+---------------+----+--------------------+--------------------+
|2aa897b7-38cd-4af...|2020-10-08-12|@short_read_5/1|   +|GTTACTTCGATATCCGC...|BCCC<EGAGGGGGGGGG...|
+--------------------+-------------+---------------+----+--------------------+--------------------+
+--------------------+-------------+---------------+----+--------------------+--------------------+
|        metadata_key|     datehour|       sequence|desc|        raw_sequence|             quality|
+--------------------+-------------+---------------+----+--------------------+--------------------+
|2aa897b7-38cd-4af...|2020-10-08-12|@short_read_5/1|   +|GTTACTTCGATATCCGC...|BCCC<EGAGGGGGGGGG...|
+--------------------+-------------+---------------+----+--------------------+--------------------+
|2aa897b7-38cd-4af...|2020-10-08-12|@short_read_5/2|   +|dTTACTTCGATATCCGC...|dCCC<EGAGGGGGGGGG...|
+--------------------+-------------+---------------+----+--------------------+--------------------+