在Pyspark中将嵌套Json转换为数据帧

在Pyspark中将嵌套Json转换为数据帧,json,dataframe,pyspark,Json,Dataframe,Pyspark,我正在尝试从json创建一个数据帧,其中包含嵌套的feild和dates feild,我希望将它们连接起来: root |-- MODEL: string (nullable = true) |-- CODE: string (nullable = true) |-- START_Time: struct (nullable = true) | |-- day: string (nullable = true) | |-- hour: string (nullable =

我正在尝试从json创建一个数据帧,其中包含嵌套的feild和dates feild,我希望将它们连接起来:

root
 |-- MODEL: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- START_Time: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- WEIGHT: string (nullable = true)
 |-- REGISTED: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- TOTAL: string (nullable = true)
 |-- SCHEDULED: struct (nullable = true)
 |    |-- day: long (nullable = true)
 |    |-- hour: long (nullable = true)
 |    |-- minute: long (nullable = true)
 |    |-- month: long (nullable = true)
 |    |-- second: long (nullable = true)
 |    |-- year: long (nullable = true)
 |-- PACKAGE: string (nullable = true)
目标是获得更类似以下的结果:

+---------+------------------+----------+-----------------+---------+-----------------+
|MODEL    |   START_Time     | WEIGHT   |REGISTED         |TOTAL    |SCHEDULED        |   
+---------+------------------+----------+-----------------+---------+-----------------+
|.........| yy-mm-dd-hh-mm-ss| WEIGHT   |yy-mm-dd-hh-mm-ss|TOTAL    |yy-mm-dd-hh-mm-ss| 
其中,yy-mm-dd-hh-mm-ss表示:天、小时、分钟。。。。在json中

|-- example: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
我尝试过爆炸功能,可能是没有使用它,因为它应该,但没有工作 有人能给我一个解决方案吗
谢谢你

你可以通过以下简单的步骤来完成

  • 让我们在data.json文件中获得如下数据
  • {“型号”:“abc”,“代码”:“代码1”,“开始时间”:{“日期”:“05”,“小时”:“08”,“分钟”:“30”,“月”:“08”,“秒”:“30”,“年”:“21”},“重量”:“231”,“登记”:{“日期”:“05”,“小时”:“08”,“分钟”:“30”,“月”:“08”,“秒”:“30”,“年”:“21”},“总计”:“1”,“计划”:{“日”:“05”,“小时”:“08”,“分钟”:“30”,“月”:“08”,“第二个”:“30”,“年”:“21”},“包”:“车”}

    此数据与您共享的架构相同

  • 在pyspark中读取此json文件,如下所示

    from pyspark.sql.functions import *
    
    df = spark.read.json('data.json')
    
    df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), col('START_Time.month'), lit('-'), col('START_Time.day'), lit('-'), col('START_Time.hour'), lit('-'), col('START_Time.minute'), lit('-'), col('START_Time.second'))).withColumn('REGISTED',concat(col('REGISTED.year'), lit('-'), col('REGISTED.month'), lit('-'), col('REGISTED.day'), lit('-'), col('REGISTED.hour'), lit('-'), col('REGISTED.minute'), lit('-'), col('REGISTED.second'))).withColumn('SCHEDULED',concat(col('SCHEDULED.year'), lit('-'), col('SCHEDULED.month'), lit('-'), col('SCHEDULED.day'), lit('-'), col('SCHEDULED.hour'), lit('-'), col('SCHEDULED.minute'), lit('-'), col('SCHEDULED.second'))).show()
    
  • 现在,您可以读取嵌套值并修改列值,如下所示

    from pyspark.sql.functions import *
    
    df = spark.read.json('data.json')
    
    df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), col('START_Time.month'), lit('-'), col('START_Time.day'), lit('-'), col('START_Time.hour'), lit('-'), col('START_Time.minute'), lit('-'), col('START_Time.second'))).withColumn('REGISTED',concat(col('REGISTED.year'), lit('-'), col('REGISTED.month'), lit('-'), col('REGISTED.day'), lit('-'), col('REGISTED.hour'), lit('-'), col('REGISTED.minute'), lit('-'), col('REGISTED.second'))).withColumn('SCHEDULED',concat(col('SCHEDULED.year'), lit('-'), col('SCHEDULED.month'), lit('-'), col('SCHEDULED.day'), lit('-'), col('SCHEDULED.hour'), lit('-'), col('SCHEDULED.minute'), lit('-'), col('SCHEDULED.second'))).show()
    
  • 输出将是

    代码 模型 包裹 登记 预定 开始时间 全部的 重量 代码1 abc 汽车 21-08-05-08-30-30 21-08-05-08-30-30 21-08-05-08-30-30 1. 231
    你问的是同一个问题,答案在这里