PySpark合并结构中的结构字段_Pyspark

PySpark合并结构中的结构字段

pyspark

PySpark合并结构中的结构字段,pyspark,Pyspark,我有一个来自数据源的结构，其中结构字段有多种可能的数据类型，如下所示： |-- priority: struct (nullable = true) | |-- priority_a: struct (nullable = true) | | |-- union: boolean (nullable = true) | | |-- int32: integer (nullable = true) | | |-- double: double (

我有一个来自数据源的结构，其中结构字段有多种可能的数据类型，如下所示：

 |-- priority: struct (nullable = true)
 |    |-- priority_a: struct (nullable = true)
 |    |    |-- union: boolean (nullable = true)
 |    |    |-- int32: integer (nullable = true)
 |    |    |-- double: double (nullable = true)
 |    |-- priority_b: integer (nullable = true)
 |    |-- priority_c: struct (nullable = true)
 |    |    |-- union: boolean (nullable = true)
 |    |    |-- double: double (nullable = true)
 |    |    |-- int32: integer (nullable = true)
 |    |-- priority_d: struct (nullable = true)
 |    |    |-- union: boolean (nullable = true)
 |    |    |-- double: double (nullable = true)
 |    |    |-- int32: integer (nullable = true)
 |    |-- priority_e: double (nullable = true)

我想合并struct字段并将其转换为最有意义的数据类型，例如：

 |-- priority: struct (nullable = true)
 |    |-- priority_a: integer (nullable = true)
 |    |-- priority_b: integer (nullable = true)
 |    |-- priority_c: double (nullable = true)
 |    |-- priority_d: double (nullable = true)
 |    |-- priority_e: double (nullable = true)

如果列不是结构中的结构字段，则以下代码非常适合我的需要：

try: 
    cols = [f'{c}.{col}' for col in source.select(f'{c}.*').columns]
    if f'{struct_path}.union' in cols:
        cols.remove(f'{struct_path}.union')
    source = source.withColumn(pc, f.coalesce(*cols).cast(t)) # t is the type I want to cast to
except:
    source = source.withColumn(c, f.col(c).cast(t))

对于嵌套的结构字段可以有多种数据类型的结构，我希望递归地执行相同的操作。可以这样做吗？

StructField

的字段可以通过

fields

属性访问，因此您可以做的是在架构中进行循环，检查每个字段是否为

StructType

从pyspark.sql导入类型为T
对于schema.fields中的字段：
如果isinstance（field.dataType，T.StructType）：
打印（field.dataType.fields）

或者如果你想递归地阅读它

def展平（模式，前缀=None）：
字段=[]
对于schema.fields中的字段：
name=前缀+'。+如果前缀为else field.name，则为field.name
dtype=field.dataType
如果isinstance（数据类型，T.ArrayType）：
dtype=dtype.elementType
如果isinstance（数据类型，T.StructType）：
打印（数据类型）
字段+=展开（数据类型，前缀=名称）
其他：
字段。追加（（数据类型，名称））
返回字段