在PySpark数组中展平嵌套结构

在PySpark数组中展平嵌套结构,pyspark,spark-dataframe,pyspark-sql,Pyspark,Spark Dataframe,Pyspark Sql,给出如下模式: root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisors: struct | | | |-- advisor1: string | | | |-- advisor2: string root |-- first_

给出如下模式:

root
|-- first_name: string
|-- last_name: string
|-- degrees: array
|    |-- element: struct
|    |    |-- school: string
|    |    |-- advisors: struct
|    |    |    |-- advisor1: string
|    |    |    |-- advisor2: string
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
|    |-- element: struct
|    |    |-- school: string
|    |    |-- advisor1: string
|    |    |-- advisor2: string
如何获得类似以下内容的模式:

root
|-- first_name: string
|-- last_name: string
|-- degrees: array
|    |-- element: struct
|    |    |-- school: string
|    |    |-- advisors: struct
|    |    |    |-- advisor1: string
|    |    |    |-- advisor2: string
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
|    |-- element: struct
|    |    |-- school: string
|    |    |-- advisor1: string
|    |    |-- advisor2: string

目前,我分解数组,通过选择
advisor.*
将结构展平,然后按
名字、姓氏
分组,并使用
收集列表
重建数组。我希望有一个更干净/更短的方法来做到这一点。目前,重命名一些我不想在这里讨论的字段和内容时会遇到很多麻烦。谢谢

您可以使用udf更改dataframe中嵌套列的数据类型。 假设您已将数据帧读取为df1

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def foo(data):
    return
    (
        list(map(
            lambda x: (
                x["school"],
                x["advisors"]["advisor1"],
                x["advisors"]["advisor1"]
            ),
            data
        ))
    )

struct = ArrayType(
    StructType([
        StructField("school", StringType()),
        StructField("advisor1", StringType()),
        StructField("advisor2", StringType())
    ])
)
udf_foo = udf(foo, struct)

df2 = df1.withColumn("degrees", udf_foo("degrees"))
df2.printSchema()
输出:

root
 |-- degrees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- school: string (nullable = true)
 |    |    |-- advisor1: string (nullable = true)
 |    |    |-- advisor2: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)

下面是一个更通用的解决方案,它可以展平多个嵌套结构层:

def flatten_df(nested_df, layers):
    flat_cols = []
    nested_cols = []
    flat_df = []

    flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
    nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])

    flat_df.append(nested_df.select(flat_cols[0] +
                               [col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols[0]
                                for c in nested_df.select(nc+'.*').columns])
                  )
    for i in range(1, layers):
        print (flat_cols[i-1])
        flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
        nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])

        flat_df.append(flat_df[i-1].select(flat_cols[i] +
                                [col(nc+'.'+c).alias(nc+'_'+c)
                                    for nc in nested_cols[i]
                                    for c in flat_df[i-1].select(nc+'.*').columns])
        )

    return flat_df[-1]
只需拨打以下电话:

my_flattened_df = flatten_df(my_df_having_structs, 3)
(第二个参数是要展平的层的级别,在我的例子中是3)