Python 创建管理空值的嵌套json文件
我正在使用pyspark,我有以下代码,可以从数据帧中创建一个嵌套的json文件,其中包含一些嵌套在需求中的字段product、quantity、from和to。下面是创建json的代码,例如一行Python 创建管理空值的嵌套json文件,python,json,pyspark,Python,Json,Pyspark,我正在使用pyspark,我有以下代码,可以从数据帧中创建一个嵌套的json文件,其中包含一些嵌套在需求中的字段product、quantity、from和to。下面是创建json的代码,例如一行 final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safet
final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version').agg(F.collect_list(F.struct(F.col("product"), F.col("quantity"), F.col("from"), F.col("to"))).alias('requirements'))
{"identifier":"xxx","plant":"xxxx","family":"xxxx","familyDescription":"xxxx","type":"assembled","name":"xxxx","description":"xxxx","batchSize":20.0,"phantom":"False","makeOrBuy":"make","safetyStock":0.0,"unit":"PZ","unitPrice":xxxx,"version":"0001","requirements":[{"product":"yyyy","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"zzzz","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"kkkk","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"wwww","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"bbbb","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"}]}
final2数据帧的架构如下所示:
|-- identifier: string (nullable = true)
|-- plant: string (nullable = true)
|-- family: string (nullable = true)
|-- familyDescription: string (nullable = true)
|-- type: string (nullable = false)
|-- name: string (nullable = true)
|-- description: string (nullable = true)
|-- batchSize: double (nullable = true)
|-- phantom: string (nullable = false)
|-- makeOrBuy: string (nullable = false)
|-- safetyStock: double (nullable = true)
|-- unit: string (nullable = true)
|-- unitPrice: double (nullable = true)
|-- version: string (nullable = true)
|-- requirements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product: string (nullable = true)
| | |-- quantity: double (nullable = true)
| | |-- from: timestamp (nullable = true)
| | |-- to: timestamp (nullable = true)
我面临一个问题,因为我必须向我的最终数据帧中添加一些product、quantity、from、to=Null的数据:使用上面的代码,我得到了要求:[{}],但是我在其中编写文件MongoDB的DB得到了一个空JSON对象错误,因为它需要要求:[]为Null值
我试过了
import pyspark.sql.functions as F
df = final_prova2.withColumn("requirements",
F.when(final_prova2.requirements.isNull(),
F.array()).otherwise(final_prova2.requirements))
但它不起作用。
对如何修改代码有什么建议吗?我正在努力寻找解决方案,考虑到所需的结构,我甚至不知道是否可能找到解决方案
谢谢您需要检查需求的4个字段是否全部为空,而不是列本身。解决此问题的一种方法是在创建final2时调整collect_list aggregate函数: 其中: 我们使用一个SQL表达式IFcondition、true\u值、false\u值来设置collect\u list的参数 条件:coalescequantity、product、from、to为NULL是为了测试列出的4列是否都为NULL,如果为true,则返回NULL,否则返回structproduct、quantity、from、to
大家好,没人能帮忙吗?
import pyspark.sql.functions as F
final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version') \
.agg(F.expr("""
collect_list(
IF(coalesce(quantity, product, from, to) is NULL
, NULL
, struct(product, quantity, from, to)
)
)
""").alias('requirements'))