Apache spark pyspark:对来自输入文件的记录进行扁平化

Apache spark pyspark:对来自输入文件的记录进行扁平化,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有如下输入csv文件- plant_id, system1_id, system2_id, system3_id A1 s1-111 s2-111 s3-111 A2 s1-222 s2-222 s3-222 A3 s1-333 s2-333 s3-333 我想把下面这样的记录展平 plant_id system_id system_name A1

我有如下输入csv文件-

plant_id,  system1_id, system2_id, system3_id
A1          s1-111      s2-111     s3-111
A2          s1-222      s2-222     s3-222
A3          s1-333      s2-333     s3-333
我想把下面这样的记录展平

plant_id    system_id     system_name   
A1          s1-111        system1
A1          s2-111        system2
A1          s3-111        system3
A2          s1-222        system1
A2          s2-222        system2
A2          s3-222        system3
A3          s1-333        system1
A3          s2-333        system2
A3          s3-333        system3

目前,我可以通过为每个系统列创建一个转置的pyspark df,然后在最后对所有df进行联合来实现它。但它需要编写一段很长的代码。有没有办法用几行代码来实现它?

使用
stack

df2 = df.selectExpr(
    'plant_id',
    """stack(
         3,
         system1_id, 'system1_id', system2_id, 'system2_id', system3_id, 'system3_id')
         as (system_id, system_name)"""
)

df2.show()
+--------+---------+-----------+
|plant_id|system_id|system_name|
+--------+---------+-----------+
|      A1|   s1-111| system1_id|
|      A1|   s2-111| system2_id|
|      A1|   s3-111| system3_id|
|      A2|   s1-222| system1_id|
|      A2|   s2-222| system2_id|
|      A2|   s3-222| system3_id|
|      A3|   s1-333| system1_id|
|      A3|   s2-333| system2_id|
|      A3|   s3-333| system3_id|
+--------+---------+-----------+

1。准备样本输入数据

from pyspark.sql import functions as F
sampleData = (('A1','s1-111','s2-111','s3-111'),
        ('A2','s1-222','s2-222','s3-222'),
        ('A3','s1-333','s2-222','s3-333')
        )
2。创建输入数据列列表
columns=['plant\u id'、'system1\u id'、'system2\u id'、'system3\u id']

3。创建Spark数据框

df = spark.createDataFrame(data=sampleData, schema=columns)
df.show()
+--------+----------+----------+----------+
|plant_id|system1_id|system2_id|system3_id|
+--------+----------+----------+----------+
|      A1|    s1-111|    s2-111|    s3-111|
|      A2|    s1-222|    s2-222|    s3-222|
|      A3|    s1-333|    s2-222|    s3-333|
+--------+----------+----------+----------+
4。我们正在使用
stack()
函数将多列分隔为行。以下是
堆栈
函数语法:
堆栈(n,expr1,…,exprk)
-将
expr1
,…,
exprk
分隔为
n行。

finalDF = df.select('plant_id',F.expr("stack(3,system1_id, 'system1_id', system2_id, 'system2_id', system3_id, 'system3_id') as (system_id, system_name)"))

finalDF.show()
+--------+---------+-----------+
|plant_id|system_id|system_name|
+--------+---------+-----------+
|      A1|   s1-111| system1_id|
|      A1|   s2-111| system2_id|
|      A1|   s3-111| system3_id|
|      A2|   s1-222| system1_id|
|      A2|   s2-222| system2_id|
|      A2|   s3-222| system3_id|
|      A3|   s1-333| system1_id|
|      A3|   s2-222| system2_id|
|      A3|   s3-333| system3_id|
+--------+---------+-----------+