Apache spark 有没有办法收集pyspark中嵌套架构中所有字段的名称_Apache Spark_Pyspark_Apache Spark Sql

Apache spark 有没有办法收集pyspark中嵌套架构中所有字段的名称

apache-spark pyspark

Apache spark 有没有办法收集pyspark中嵌套架构中所有字段的名称,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我希望收集嵌套模式中所有字段的名称。数据是从json文件导入的模式如下所示： root |-- column_a: string (nullable = true) |-- column_b: string (nullable = true) |-- column_c: struct (nullable = true) | |-- nested_a: struct (nullable = true) | | |-- double_nested_a: string (

我希望收集嵌套模式中所有字段的名称。数据是从json文件导入的

模式如下所示：

root
 |-- column_a: string (nullable = true)
 |-- column_b: string (nullable = true)
 |-- column_c: struct (nullable = true)
 |    |-- nested_a: struct (nullable = true)
 |    |    |-- double_nested_a: string (nullable = true)
 |    |    |-- double_nested_b: string (nullable = true)
 |    |    |-- double_nested_c: string (nullable = true)
 |    |-- nested_b: string (nullable = true)
 |-- column_d: string (nullable = true)

如果我使用

df.schema.fields

或

df.schema.names

它只打印列层的名称，而不打印任何嵌套列

我想要的输出是一个python列表，其中包含所有列名，例如：

['column_a', 'columb_b', 'column_c.nested_a.double_nested.a', 'column_c.nested_a.double_nested.b', etc...]

如果我想写一个自定义函数，信息就存在于那里——但我是否错过了一个节拍？是否有一种方法可以满足我的需要？

默认情况下，Spark中没有任何方法可以为我们平展模式名称

使用post中的代码：

def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, ArrayType):
            dtype = dtype.elementType

        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)

    return fields


df.printSchema()
#root
# |-- column_a: string (nullable = true)
# |-- column_c: struct (nullable = true)
# |    |-- nested_a: struct (nullable = true)
# |    |    |-- double_nested_a: string (nullable = true)
# |    |-- nested_b: string (nullable = true)
# |-- column_d: string (nullable = true)

sch=df.schema

print(flatten(sch))
#['column_a', 'column_c.nested_a.double_nested_a', 'column_c.nested_b', 'column_d']