使用pyspark展平复杂JSON模式
我正在尝试扁平化一个复杂的JSON结构,该结构包含嵌套的数组和结构元素,使用一个通用函数,该函数应该适用于任何模式的JSON文件 下面是示例JSON结构的一部分,我想将其展平使用pyspark展平复杂JSON模式,pyspark,Pyspark,我正在尝试扁平化一个复杂的JSON结构,该结构包含嵌套的数组和结构元素,使用一个通用函数,该函数应该适用于任何模式的JSON文件 下面是示例JSON结构的一部分,我想将其展平 root |-- Data: struct (nullable = true) | |-- Record: struct (nullable = true) | | |-- FName: string (nullable = true) | | |-- LName: long (nul
root
|-- Data: struct (nullable = true)
| |-- Record: struct (nullable = true)
| | |-- FName: string (nullable = true)
| | |-- LName: long (nullable = true)
| | |-- Address: struct (nullable = true)
| | | |-- Applicant: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Id: long (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | | | |-- Option: long (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- Town: long (nullable = true)
| | |-- IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
到
我正在使用下面线程中建议的代码
但它不适用于数组元素。使用上面的代码,我得到如下输出。你能帮忙吗。我如何修改这段代码以包含数组呢
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
这不是重复的,因为没有关于将包含数组的复杂JSON模式展平的通用函数的帖子。在数组上使用
分解
,然后用代码再次处理。是的,我尝试按如下方式修改上述内容,但没有按预期工作。你能帮忙吗。数组[c[0]表示展开式[i-1]中的c。如果c[1][:5]='array'])展开式[i-1]表示展开式[c]表示展开式[i-1]。选择(展开式[i]+[c]表示展开式[i]中的c。别名(nc++'++c)表示展开式[i-1]中的c。选择(nc++'.*')).columns for nc in array_cols[i]for c in flat_df[i-1]。选择(explode(nc+'.'.'.'.columns)])在数组上使用explode
,然后使用代码再次处理。是的,我尝试按以下方式修改上述内容,但未按预期工作。你能帮忙吗。数组[c[0]表示展开式[i-1]中的c。如果c[1][:5]='array'])展开式[i-1]表示展开式[c]表示展开式[i-1]。选择(展开式[i]+[c]表示展开式[i]中的c。别名(nc++'++c)表示展开式[i-1]中的c。选择(nc++'.*'))数组中nc的.columns[i]表示平面中c的.df[i-1]。选择(分解(nc+'.'.''.'.columns)])
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)