PySpark合并结构中的结构字段
我有一个来自数据源的结构,其中结构字段有多种可能的数据类型,如下所示:PySpark合并结构中的结构字段,pyspark,Pyspark,我有一个来自数据源的结构,其中结构字段有多种可能的数据类型,如下所示: |-- priority: struct (nullable = true) | |-- priority_a: struct (nullable = true) | | |-- union: boolean (nullable = true) | | |-- int32: integer (nullable = true) | | |-- double: double (
|-- priority: struct (nullable = true)
| |-- priority_a: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- int32: integer (nullable = true)
| | |-- double: double (nullable = true)
| |-- priority_b: integer (nullable = true)
| |-- priority_c: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- double: double (nullable = true)
| | |-- int32: integer (nullable = true)
| |-- priority_d: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- double: double (nullable = true)
| | |-- int32: integer (nullable = true)
| |-- priority_e: double (nullable = true)
我想合并struct字段并将其转换为最有意义的数据类型,例如:
|-- priority: struct (nullable = true)
| |-- priority_a: integer (nullable = true)
| |-- priority_b: integer (nullable = true)
| |-- priority_c: double (nullable = true)
| |-- priority_d: double (nullable = true)
| |-- priority_e: double (nullable = true)
如果列不是结构中的结构字段,则以下代码非常适合我的需要:
try:
cols = [f'{c}.{col}' for col in source.select(f'{c}.*').columns]
if f'{struct_path}.union' in cols:
cols.remove(f'{struct_path}.union')
source = source.withColumn(pc, f.coalesce(*cols).cast(t)) # t is the type I want to cast to
except:
source = source.withColumn(c, f.col(c).cast(t))
对于嵌套的结构字段可以有多种数据类型的结构,我希望递归地执行相同的操作。可以这样做吗?
StructField
的字段可以通过fields
属性访问,因此您可以做的是在架构中进行循环,检查每个字段是否为StructType
从pyspark.sql导入类型为T
对于schema.fields中的字段:
如果isinstance(field.dataType,T.StructType):
打印(field.dataType.fields)
或者如果你想递归地阅读它
def展平(模式,前缀=None):
字段=[]
对于schema.fields中的字段:
name=前缀+'。+如果前缀为else field.name,则为field.name
dtype=field.dataType
如果isinstance(数据类型,T.ArrayType):
dtype=dtype.elementType
如果isinstance(数据类型,T.StructType):
打印(数据类型)
字段+=展开(数据类型,前缀=名称)
其他:
字段。追加((数据类型,名称))
返回字段