PySpark合并结构中的结构字段

PySpark合并结构中的结构字段,pyspark,Pyspark,我有一个来自数据源的结构,其中结构字段有多种可能的数据类型,如下所示: |-- priority: struct (nullable = true) | |-- priority_a: struct (nullable = true) | | |-- union: boolean (nullable = true) | | |-- int32: integer (nullable = true) | | |-- double: double (

我有一个来自数据源的结构,其中结构字段有多种可能的数据类型,如下所示:

 |-- priority: struct (nullable = true)
 |    |-- priority_a: struct (nullable = true)
 |    |    |-- union: boolean (nullable = true)
 |    |    |-- int32: integer (nullable = true)
 |    |    |-- double: double (nullable = true)
 |    |-- priority_b: integer (nullable = true)
 |    |-- priority_c: struct (nullable = true)
 |    |    |-- union: boolean (nullable = true)
 |    |    |-- double: double (nullable = true)
 |    |    |-- int32: integer (nullable = true)
 |    |-- priority_d: struct (nullable = true)
 |    |    |-- union: boolean (nullable = true)
 |    |    |-- double: double (nullable = true)
 |    |    |-- int32: integer (nullable = true)
 |    |-- priority_e: double (nullable = true)
我想合并struct字段并将其转换为最有意义的数据类型,例如:

 |-- priority: struct (nullable = true)
 |    |-- priority_a: integer (nullable = true)
 |    |-- priority_b: integer (nullable = true)
 |    |-- priority_c: double (nullable = true)
 |    |-- priority_d: double (nullable = true)
 |    |-- priority_e: double (nullable = true)
如果列不是结构中的结构字段,则以下代码非常适合我的需要:

try: 
    cols = [f'{c}.{col}' for col in source.select(f'{c}.*').columns]
    if f'{struct_path}.union' in cols:
        cols.remove(f'{struct_path}.union')
    source = source.withColumn(pc, f.coalesce(*cols).cast(t)) # t is the type I want to cast to
except:
    source = source.withColumn(c, f.col(c).cast(t))

对于嵌套的结构字段可以有多种数据类型的结构,我希望递归地执行相同的操作。可以这样做吗?

StructField
的字段可以通过
fields
属性访问,因此您可以做的是在架构中进行循环,检查每个字段是否为
StructType

从pyspark.sql导入类型为T
对于schema.fields中的字段:
如果isinstance(field.dataType,T.StructType):
打印(field.dataType.fields)
或者如果你想递归地阅读它

def展平(模式,前缀=None):
字段=[]
对于schema.fields中的字段:
name=前缀+'。+如果前缀为else field.name,则为field.name
dtype=field.dataType
如果isinstance(数据类型,T.ArrayType):
dtype=dtype.elementType
如果isinstance(数据类型,T.StructType):
打印(数据类型)
字段+=展开(数据类型,前缀=名称)
其他:
字段。追加((数据类型,名称))
返回字段