Arrays 如何使用spark SQL正确分解JSON中的字段

Arrays 如何使用spark SQL正确分解JSON中的字段,arrays,json,apache-spark,pyspark,Arrays,Json,Apache Spark,Pyspark,我正在研究如何使用spark.sql()提取数据以提高性能。但是我有一个令人难以置信的嵌套JSON,我很难从中获取数据 以下是JSON的模式: root |-- httpStatus: long (nullable = true) |-- httpStatusMessage: string (nullable = true) |-- response: struct (nullable = true) | |-- body: struct (nullable = true) |

我正在研究如何使用spark.sql()提取数据以提高性能。但是我有一个令人难以置信的嵌套JSON,我很难从中获取数据

以下是JSON的模式:

root
 |-- httpStatus: long (nullable = true)
 |-- httpStatusMessage: string (nullable = true)
 |-- response: struct (nullable = true)
 |    |-- body: struct (nullable = true)
 |    |    |-- dataProviders: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- dataProviderId: long (nullable = true)
 |    |    |    |    |-- drivers: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- driverFirstName: string (nullable = true)
 |    |    |    |    |    |    |-- driverId: long (nullable = true)
 |    |    |    |    |    |    |-- driverLastName: string (nullable = true)
 |    |    |    |    |    |    |-- driverRef: string (nullable = true)
 |    |    |    |    |    |    |-- totalDistance: double (nullable = true)
 |    |    |    |    |    |    |-- vehicles: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |-- deviceId: long (nullable = true)
 |    |    |    |    |    |    |    |    |-- deviceRef: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- trips: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |-- averageSpeed: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- tripDistanceTravelled: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- tripDuration: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- tripId: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- tripStart: struct (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- heading: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- latitude: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- longitude: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- mileage: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- speed: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- timestamp: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- tripStop: struct (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- heading: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- latitude: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- longitude: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- mileage: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- speed: double (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- timestamp: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- vehicleId: long (nullable = true)
 |    |    |    |    |    |    |    |    |-- vehicleRef: string (nullable = true)
 |    |-- header: struct (nullable = true)
 |    |    |-- accelUnit: string (nullable = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- distanceUnit: string (nullable = true)
 |    |    |-- fleetId: long (nullable = true)
 |    |    |-- fleetName: string (nullable = true)
 |    |    |-- gpsUnit: string (nullable = true)
 |    |    |-- speedUnit: string (nullable = true)
 |-- timestamp: string (nullable = true)
我一直在尝试分解这些字段以获得嵌套最多的字段,但在通过
arrayType
时遇到了问题

以下是我的代码示例:

json_df = spark.read.json('/user/myuser/drivers_directory/driverRates.json')

json_df.printSchema()

json_df.show()
+----------+-----------------+--------------------+-------------------+
|httpStatus|httpStatusMessage|            response|          timestamp|
+----------+-----------------+--------------------+-------------------+
|       200|          success|[[[[14, [[Eric, 1...|2020-11-11T19:46:01|
+----------+-----------------+--------------------+-------------------+

body_df = json_df.select('response.*').show()

json_df.select('response.*').select('body.*').show()
+--------------------+
|       dataProviders|
+--------------------+
|[[14, [[Eric, 100...|
+--------------------+


json_df.select('response.*').select('body.*').select('dataProviders.dataProviderId').show()
+--------------+
|dataProviderId|
+--------------+
|          [14]|
+--------------+
然而,在每个领域都这样做是非常乏味的,而且对性能来说是非常糟糕的

我一直在尝试使用spark.sql()来获取所有信息,但基于
StructType
arrayType

想要像这样的东西:

json_df.createOrReplaceTempView('driver_dictionary')

final_driver_df = spark.sql("""select
            , httpStatus as status
            , httpStatusMessage as message
            , timestamp as time
            from driver_dictionary
            lateral view explode(response) as r
            """)

我遇到的问题是试图爆炸尸体及其下面的数据。使用横向视图时会出现StructType错误,使用横向视图时会出现ArrayType错误。非常感谢您的帮助。

我想要的是:

drivers_exploded_df = spark.sql('''select
    httpStatus
    , httpStatusMessage
    , response.header.*
    , dataProviders.dataProviderId
    , drivers.driverId
    , drivers.driverRef
    , drivers.firstName
    , drivers.lastName
    , timestamp
    from drivers_explode
    lateral view outer explode (response.body.dataProviders) providers_tbl as dataProviders
    lateral view outer explode (dataProviders.drivers) dataProviders_drivers as drivers''')