Arrays 如何使用spark SQL正确分解JSON中的字段
我正在研究如何使用spark.sql()提取数据以提高性能。但是我有一个令人难以置信的嵌套JSON,我很难从中获取数据 以下是JSON的模式:Arrays 如何使用spark SQL正确分解JSON中的字段,arrays,json,apache-spark,pyspark,Arrays,Json,Apache Spark,Pyspark,我正在研究如何使用spark.sql()提取数据以提高性能。但是我有一个令人难以置信的嵌套JSON,我很难从中获取数据 以下是JSON的模式: root |-- httpStatus: long (nullable = true) |-- httpStatusMessage: string (nullable = true) |-- response: struct (nullable = true) | |-- body: struct (nullable = true) |
root
|-- httpStatus: long (nullable = true)
|-- httpStatusMessage: string (nullable = true)
|-- response: struct (nullable = true)
| |-- body: struct (nullable = true)
| | |-- dataProviders: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- dataProviderId: long (nullable = true)
| | | | |-- drivers: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- driverFirstName: string (nullable = true)
| | | | | | |-- driverId: long (nullable = true)
| | | | | | |-- driverLastName: string (nullable = true)
| | | | | | |-- driverRef: string (nullable = true)
| | | | | | |-- totalDistance: double (nullable = true)
| | | | | | |-- vehicles: array (nullable = true)
| | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | |-- deviceId: long (nullable = true)
| | | | | | | | |-- deviceRef: string (nullable = true)
| | | | | | | | |-- trips: array (nullable = true)
| | | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | | |-- averageSpeed: double (nullable = true)
| | | | | | | | | | |-- tripDistanceTravelled: double (nullable = true)
| | | | | | | | | | |-- tripDuration: double (nullable = true)
| | | | | | | | | | |-- tripId: string (nullable = true)
| | | | | | | | | | |-- tripStart: struct (nullable = true)
| | | | | | | | | | | |-- heading: double (nullable = true)
| | | | | | | | | | | |-- latitude: double (nullable = true)
| | | | | | | | | | | |-- longitude: double (nullable = true)
| | | | | | | | | | | |-- mileage: double (nullable = true)
| | | | | | | | | | | |-- speed: double (nullable = true)
| | | | | | | | | | | |-- timestamp: string (nullable = true)
| | | | | | | | | | |-- tripStop: struct (nullable = true)
| | | | | | | | | | | |-- heading: double (nullable = true)
| | | | | | | | | | | |-- latitude: double (nullable = true)
| | | | | | | | | | | |-- longitude: double (nullable = true)
| | | | | | | | | | | |-- mileage: double (nullable = true)
| | | | | | | | | | | |-- speed: double (nullable = true)
| | | | | | | | | | | |-- timestamp: string (nullable = true)
| | | | | | | | |-- vehicleId: long (nullable = true)
| | | | | | | | |-- vehicleRef: string (nullable = true)
| |-- header: struct (nullable = true)
| | |-- accelUnit: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- distanceUnit: string (nullable = true)
| | |-- fleetId: long (nullable = true)
| | |-- fleetName: string (nullable = true)
| | |-- gpsUnit: string (nullable = true)
| | |-- speedUnit: string (nullable = true)
|-- timestamp: string (nullable = true)
我一直在尝试分解这些字段以获得嵌套最多的字段,但在通过arrayType
时遇到了问题
以下是我的代码示例:
json_df = spark.read.json('/user/myuser/drivers_directory/driverRates.json')
json_df.printSchema()
json_df.show()
+----------+-----------------+--------------------+-------------------+
|httpStatus|httpStatusMessage| response| timestamp|
+----------+-----------------+--------------------+-------------------+
| 200| success|[[[[14, [[Eric, 1...|2020-11-11T19:46:01|
+----------+-----------------+--------------------+-------------------+
body_df = json_df.select('response.*').show()
json_df.select('response.*').select('body.*').show()
+--------------------+
| dataProviders|
+--------------------+
|[[14, [[Eric, 100...|
+--------------------+
json_df.select('response.*').select('body.*').select('dataProviders.dataProviderId').show()
+--------------+
|dataProviderId|
+--------------+
| [14]|
+--------------+
然而,在每个领域都这样做是非常乏味的,而且对性能来说是非常糟糕的
我一直在尝试使用spark.sql()来获取所有信息,但基于StructType
和arrayType
想要像这样的东西:
json_df.createOrReplaceTempView('driver_dictionary')
final_driver_df = spark.sql("""select
, httpStatus as status
, httpStatusMessage as message
, timestamp as time
from driver_dictionary
lateral view explode(response) as r
""")
我遇到的问题是试图爆炸尸体及其下面的数据。使用横向视图时会出现StructType错误,使用横向视图时会出现ArrayType错误。非常感谢您的帮助。我想要的是:
drivers_exploded_df = spark.sql('''select
httpStatus
, httpStatusMessage
, response.header.*
, dataProviders.dataProviderId
, drivers.driverId
, drivers.driverRef
, drivers.firstName
, drivers.lastName
, timestamp
from drivers_explode
lateral view outer explode (response.body.dataProviders) providers_tbl as dataProviders
lateral view outer explode (dataProviders.drivers) dataProviders_drivers as drivers''')