Pyspark 将具有相同键的json解析到不同的列_Pyspark_Pyspark Dataframes

Pyspark 将具有相同键的json解析到不同的列

pyspark

Pyspark 将具有相同键的json解析到不同的列,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,接下来是我的json模式如果字段不为空，则报告\u行\u行\u单元格，如下所示： [Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='Business Bank Account'), Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='10

接下来是我的json模式

如果字段

不为空，则报告\u行\u行\u单元格

，如下所示：

[Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='Business Bank Account'),
 Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='10105.54'),
 Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='4938.48')]

-------- |-------|Reports_Rows_Rows_Cells_Value|
                 | Business Bank Account       |
                 | 10105.54                    |
                 | 4938.48                     |

我想要的是创建一个包含以上所有列的表，列

报告\u行\u行\u单元格

应该是这样的

-------- |-------|Reports_Rows_Rows_Cells_Value | Reports_Rows_Rows_Cells_Value | Reports_Rows_Rows_Cells_Value|
                 | Business Bank Account        |10105.54                       | 4938.48

解析json后，我的表如下所示：

[Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='Business Bank Account'),
 Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='10105.54'),
 Row(Attributes=[Row(Id='account', Value='bd9e85e0-0478-433d-ae9f-0b3c4f04bfe4')], Value='4938.48')]

-------- |-------|Reports_Rows_Rows_Cells_Value|
                 | Business Bank Account       |
                 | 10105.54                    |
                 | 4938.48                     |

我用来解析json的代码

def flatten_df(nested_df):
    # for ncol in nested_df.columns:
    array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
    for col in array_cols:
        nested_df = nested_df.withColumn(col, explode_outer(nested_df[col]))
    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
    if len(nested_cols) == 0:
        return nested_df
    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
    flat_df = nested_df.select(flat_cols + 
                                [F.col(nc+'.'+c).alias(nc+'_'+c) 
                                for nc in nested_cols
                                for c in nested_df.select(nc+'.*').columns])
    return flatten_df(flat_df)

您可以显示您的代码吗？Steven我更改了问题并添加了代码。您可以看到吗？您可以直接解压缩数组，如图所示：您可以显示您的代码吗？Steven我更改了问题并添加了代码。您可以看到吗？您可以直接解压缩数组，如图所示：