Python PySpark：以dict形式收集具有嵌套列的数据帧_Python_Dictionary_Pyspark

Python PySpark：以dict形式收集具有嵌套列的数据帧

python dictionary pyspark

Python PySpark：以dict形式收集具有嵌套列的数据帧,python,dictionary,pyspark,Python,Dictionary,Pyspark,我有一个具有以下嵌套架构的数据帧： root |-- data: struct (nullable = true) | |-- ac_failure: string (nullable = true) | |-- ac_failure_delayed: string (nullable = true) | |-- alarm_exit_error: boolean (nullable = true) | |-- alarm_has_delay: string (

我有一个具有以下嵌套架构的数据帧：

root
 |-- data: struct (nullable = true)
 |    |-- ac_failure: string (nullable = true)
 |    |-- ac_failure_delayed: string (nullable = true)
 |    |-- alarm_exit_error: boolean (nullable = true)
 |    |-- alarm_has_delay: string (nullable = true)
 |    |-- nodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- battery_status: string (nullable = true)
 |    |    |    |-- device_id: long (nullable = true)
 |    |    |    |-- device_manufacture_id: long (nullable = true)
 |    |    |    |-- device_name: string (nullable = true)
 |    |    |    |-- device_product_id: long (nullable = true)
 |    |    |    |-- device_state: string (nullable = true)
 |    |    |    |-- device_status: string (nullable = true)
 |    |    |    |-- device_supported_command_class_list: string (nullable = true)
 |    |    |    |-- device_type: string (nullable = true)
 |    |    |    |-- endpoint_id: long (nullable = true)
 |    |    |    |-- partition_id: long (nullable = true)
 |-- device_id: long (nullable = true)
 |-- device_type: string (nullable = true)
 |-- event: string (nullable = true)
 |-- event_class: string (nullable = true)
 |-- event_timestamp: long (nullable = true)
 |-- event_type: string (nullable = true)
 |-- imei: string (nullable = true)
 |-- partition_id: long (nullable = true)
 |-- source: string (nullable = true)

我想把这一行收集成字典。我试过：

我从中得到的是（示例1行）：

我可以做些什么来获取数据以及dict。例如：

我希望所有嵌套列都作为dict，而不是pyspark.sql.types.row。 TIA谢谢@jxc。这项工作：

seq = [row.asDict(recursive=True) for row in df2_final.collect()]

使用

row.asDict（recursive=True）

谢谢@jxc。这是有效的：对于df2_final.collect（）中的行，seq=[row.asDict（recursive=True）]

    {'data': Row(ac_failure=None, ac_failure_delayed=None, alarm_exit_error=None, alarm_has_delay='true', nodes=None),
 'device_id': 2,
 'device_type': 'panel',
 'event': 'alarm_state',
 'event_class': 'panel_alarm',
 'event_timestamp': 1586921122886,
 'event_type': 'zone_alarm_perimeter',
 'imei': '9900000000000',
 'operation': 'report',
 'partition_id': 0,
 'source': 'panel'}

{'data': {ac_failure=None, ac_failure_delayed=None, alarm_exit_error=None, alarm_has_delay='true', nodes=None},
     'device_id': 2,
     'device_type': 'panel',
     'event': 'alarm_state',
     'event_class': 'panel_alarm',
     'event_timestamp': 1586921122886,
     'event_type': 'zone_alarm_perimeter',
     'imei': '9900000000000',
     'operation': 'report',
     'partition_id': 0,
     'source': 'panel'}

seq = [row.asDict(recursive=True) for row in df2_final.collect()]