Pyspark 旋转所有行的Pypark
我有一个要透视的数据帧,我要透视所有行,而不仅仅是max、min或first:Pyspark 旋转所有行的Pypark,pyspark,pivot,aws-glue,Pyspark,Pivot,Aws Glue,我有一个要透视的数据帧,我要透视所有行,而不仅仅是max、min或first: df = spark.createDataFrame( [ ('sequence', '@short_read_5/1', '2aa897b7-38cd-4af...', '2020-10-08-12'), ('desc', '+', '2aa897b7-38cd-4af...', '2020-10-08-12'), ('raw_sequence', 'GTTA
df = spark.createDataFrame(
[
('sequence', '@short_read_5/1', '2aa897b7-38cd-4af...', '2020-10-08-12'),
('desc', '+', '2aa897b7-38cd-4af...', '2020-10-08-12'),
('raw_sequence', 'GTTACTTCGATATCCGC...', '2aa897b7-38cd-4af...', '2020-10-08-12'),
('quality', 'BCCC<EGAGGGGGGGGG...', '2aa897b7-38cd-4af...', '2020-10-08-12'),
('sequence', '@short_read_5/2', '2aa897b7-38cd-4af...', '2020-10-08-12'),
('desc', '+', '2aa897b7-38cd-4af...', '2020-10-08-12'),
('raw_sequence', 'dTTACTTCGATATCCGC...', '2aa897b7-38cd-4af...', '2020-10-08-12'),
('quality', 'dCCC<EGAGGGGGGGGG...', '2aa897b7-38cd-4af...', '2020-10-08-12'),
],
['col', 'data', 'metadata_key', 'datehour']
)
+------------+--------------------+--------------------+-------------+
| col| data| metadata_key| datehour|
+------------+--------------------+--------------------+-------------+
| sequence| @short_read_5/1|2aa897b7-38cd-4af...|2020-10-08-12|
| desc| +|2aa897b7-38cd-4af...|2020-10-08-12|
|raw_sequence|GTTACTTCGATATCCGC...|2aa897b7-38cd-4af...|2020-10-08-12|
| quality|BCCC<EGAGGGGGGGGG...|2aa897b7-38cd-4af...|2020-10-08-12|
| sequence| @short_read_5/2|2aa897b7-38cd-4af...|2020-10-08-12|
| desc| +|2aa897b7-38cd-4af...|2020-10-08-12|
|raw_sequence|dTTACTTCGATATCCGC...|2aa897b7-38cd-4af...|2020-10-08-12|
| quality|dCCC<EGAGGGGGGGGG...|2aa897b7-38cd-4af...|2020-10-08-12|
+------------+--------------------+--------------------+-------------+
df=spark.createDataFrame(
[
(“序列”、“短读”5/1、“2aa897b7-38cd-4af…”、“2020-10-08-12”),
(“说明”、“说明”、“2aa897b7-38cd-4af…”、“2020-10-08-12”),
(‘原始序列’、‘GTTACTCGATCCGC…’、‘2aa897b7-38cd-4af…’、‘2020-10-08-12’),
('quality'、'BCCCWhydtacttcgatatccgc
与@short\u read\u 5/2
在同一行上?逻辑是什么?如果你不能回答这个简单的问题,这就是为什么你最后只有一行的原因。
df.groupBy('metadata_key', 'datehour').pivot("col", ["sequence", "desc", "raw_sequence", "quality"]).agg(F.first(F.col("data"))).show()
+--------------------+-------------+---------------+----+--------------------+--------------------+
| metadata_key| datehour| sequence|desc| raw_sequence| quality|
+--------------------+-------------+---------------+----+--------------------+--------------------+
|2aa897b7-38cd-4af...|2020-10-08-12|@short_read_5/1| +|GTTACTTCGATATCCGC...|BCCC<EGAGGGGGGGGG...|
+--------------------+-------------+---------------+----+--------------------+--------------------+
+--------------------+-------------+---------------+----+--------------------+--------------------+
| metadata_key| datehour| sequence|desc| raw_sequence| quality|
+--------------------+-------------+---------------+----+--------------------+--------------------+
|2aa897b7-38cd-4af...|2020-10-08-12|@short_read_5/1| +|GTTACTTCGATATCCGC...|BCCC<EGAGGGGGGGGG...|
+--------------------+-------------+---------------+----+--------------------+--------------------+
|2aa897b7-38cd-4af...|2020-10-08-12|@short_read_5/2| +|dTTACTTCGATATCCGC...|dCCC<EGAGGGGGGGGG...|
+--------------------+-------------+---------------+----+--------------------+--------------------+