Python PySpark:以dict形式收集具有嵌套列的数据帧
我有一个具有以下嵌套架构的数据帧:Python PySpark:以dict形式收集具有嵌套列的数据帧,python,dictionary,pyspark,Python,Dictionary,Pyspark,我有一个具有以下嵌套架构的数据帧: root |-- data: struct (nullable = true) | |-- ac_failure: string (nullable = true) | |-- ac_failure_delayed: string (nullable = true) | |-- alarm_exit_error: boolean (nullable = true) | |-- alarm_has_delay: string (
root
|-- data: struct (nullable = true)
| |-- ac_failure: string (nullable = true)
| |-- ac_failure_delayed: string (nullable = true)
| |-- alarm_exit_error: boolean (nullable = true)
| |-- alarm_has_delay: string (nullable = true)
| |-- nodes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- battery_status: string (nullable = true)
| | | |-- device_id: long (nullable = true)
| | | |-- device_manufacture_id: long (nullable = true)
| | | |-- device_name: string (nullable = true)
| | | |-- device_product_id: long (nullable = true)
| | | |-- device_state: string (nullable = true)
| | | |-- device_status: string (nullable = true)
| | | |-- device_supported_command_class_list: string (nullable = true)
| | | |-- device_type: string (nullable = true)
| | | |-- endpoint_id: long (nullable = true)
| | | |-- partition_id: long (nullable = true)
|-- device_id: long (nullable = true)
|-- device_type: string (nullable = true)
|-- event: string (nullable = true)
|-- event_class: string (nullable = true)
|-- event_timestamp: long (nullable = true)
|-- event_type: string (nullable = true)
|-- imei: string (nullable = true)
|-- partition_id: long (nullable = true)
|-- source: string (nullable = true)
我想把这一行收集成字典。
我试过:
我从中得到的是(示例1行):
我可以做些什么来获取数据以及dict。
例如:
我希望所有嵌套列都作为dict,而不是pyspark.sql.types.row。
TIA谢谢@jxc。
这项工作:
seq = [row.asDict(recursive=True) for row in df2_final.collect()]
使用
row.asDict(recursive=True)
谢谢@jxc。这是有效的:对于df2_final.collect()中的行,seq=[row.asDict(recursive=True)]
{'data': Row(ac_failure=None, ac_failure_delayed=None, alarm_exit_error=None, alarm_has_delay='true', nodes=None),
'device_id': 2,
'device_type': 'panel',
'event': 'alarm_state',
'event_class': 'panel_alarm',
'event_timestamp': 1586921122886,
'event_type': 'zone_alarm_perimeter',
'imei': '9900000000000',
'operation': 'report',
'partition_id': 0,
'source': 'panel'}
{'data': {ac_failure=None, ac_failure_delayed=None, alarm_exit_error=None, alarm_has_delay='true', nodes=None},
'device_id': 2,
'device_type': 'panel',
'event': 'alarm_state',
'event_class': 'panel_alarm',
'event_timestamp': 1586921122886,
'event_type': 'zone_alarm_perimeter',
'imei': '9900000000000',
'operation': 'report',
'partition_id': 0,
'source': 'panel'}
seq = [row.asDict(recursive=True) for row in df2_final.collect()]