Apache spark pyspark:对来自输入文件的记录进行扁平化
我有如下输入csv文件-Apache spark pyspark:对来自输入文件的记录进行扁平化,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有如下输入csv文件- plant_id, system1_id, system2_id, system3_id A1 s1-111 s2-111 s3-111 A2 s1-222 s2-222 s3-222 A3 s1-333 s2-333 s3-333 我想把下面这样的记录展平 plant_id system_id system_name A1
plant_id, system1_id, system2_id, system3_id
A1 s1-111 s2-111 s3-111
A2 s1-222 s2-222 s3-222
A3 s1-333 s2-333 s3-333
我想把下面这样的记录展平
plant_id system_id system_name
A1 s1-111 system1
A1 s2-111 system2
A1 s3-111 system3
A2 s1-222 system1
A2 s2-222 system2
A2 s3-222 system3
A3 s1-333 system1
A3 s2-333 system2
A3 s3-333 system3
目前,我可以通过为每个系统列创建一个转置的pyspark df,然后在最后对所有df进行联合来实现它。但它需要编写一段很长的代码。有没有办法用几行代码来实现它?使用
stack
:
df2 = df.selectExpr(
'plant_id',
"""stack(
3,
system1_id, 'system1_id', system2_id, 'system2_id', system3_id, 'system3_id')
as (system_id, system_name)"""
)
df2.show()
+--------+---------+-----------+
|plant_id|system_id|system_name|
+--------+---------+-----------+
| A1| s1-111| system1_id|
| A1| s2-111| system2_id|
| A1| s3-111| system3_id|
| A2| s1-222| system1_id|
| A2| s2-222| system2_id|
| A2| s3-222| system3_id|
| A3| s1-333| system1_id|
| A3| s2-333| system2_id|
| A3| s3-333| system3_id|
+--------+---------+-----------+
1。准备样本输入数据
from pyspark.sql import functions as F
sampleData = (('A1','s1-111','s2-111','s3-111'),
('A2','s1-222','s2-222','s3-222'),
('A3','s1-333','s2-222','s3-333')
)
2。创建输入数据列列表columns=['plant\u id'、'system1\u id'、'system2\u id'、'system3\u id']
3。创建Spark数据框
df = spark.createDataFrame(data=sampleData, schema=columns)
df.show()
+--------+----------+----------+----------+
|plant_id|system1_id|system2_id|system3_id|
+--------+----------+----------+----------+
| A1| s1-111| s2-111| s3-111|
| A2| s1-222| s2-222| s3-222|
| A3| s1-333| s2-222| s3-333|
+--------+----------+----------+----------+
4。我们正在使用stack()
函数将多列分隔为行。以下是堆栈
函数语法:堆栈(n,expr1,…,exprk)
-将expr1
,…,exprk
分隔为n行。
finalDF = df.select('plant_id',F.expr("stack(3,system1_id, 'system1_id', system2_id, 'system2_id', system3_id, 'system3_id') as (system_id, system_name)"))
finalDF.show()
+--------+---------+-----------+
|plant_id|system_id|system_name|
+--------+---------+-----------+
| A1| s1-111| system1_id|
| A1| s2-111| system2_id|
| A1| s3-111| system3_id|
| A2| s1-222| system1_id|
| A2| s2-222| system2_id|
| A2| s3-222| system3_id|
| A3| s1-333| system1_id|
| A3| s2-222| system2_id|
| A3| s3-333| system3_id|
+--------+---------+-----------+