将Spark dataframe导出为带有自定义元数据的JSon数组
我在MongoDB中存储了一些JSON文档。每个文档看起来像:将Spark dataframe导出为带有自定义元数据的JSon数组,json,mongodb,scala,apache-spark,Json,Mongodb,Scala,Apache Spark,我在MongoDB中存储了一些JSON文档。每个文档看起来像:{“businessData”:{“capacity”:{“fuelCapacity”:282},…} 阅读完所有文档后,我想将它们导出为有效的JSON文件。具体来说: //从数据库读取JSON数据 val df:DataFrame=MongoSpark.load(sparkSession,readConfig) df.show //导出到文件系统中 coalesce(1.write.mode(SaveMode.Overwrite).
{“businessData”:{“capacity”:{“fuelCapacity”:282},…}
阅读完所有文档后,我想将它们导出为有效的JSON文件。具体来说:
//从数据库读取JSON数据
val df:DataFrame=MongoSpark.load(sparkSession,readConfig)
df.show
//导出到文件系统中
coalesce(1.write.mode(SaveMode.Overwrite).json(“export.json”)
但当我导出到文件系统时,我希望将这5行合并到一个数组中,并添加一些自定义元数据。例如:
{
"metadata" : { "exportTime": "20/20/2020" , ...}
"allBusinessData" : [
{"businessData":{"capacity":{"fuelCapacity":282}, ..},
// all 5 rows from above
]
}
我看到了一些问题和建议,它们也部分地回答了这个问题,因为没有向导出添加自定义json结构
然而,假设这是我唯一可以继续下去的方法,我该怎么做呢
非常感谢!来自Spark-2.2+: 您可以尝试在spark中使用(或)创建
struct
字段,然后以json格式编写df以获得所需的输出
- 对于示例数据,我假设exportedtime为当前_时间戳()
示例:
val df=spark.read.json(Seq("""[{"businessData":{"capacity":{"fuelCapacity":282}}},{"businessData":{"capacity":{"fuelCapacity":456}}}""").toDS)
//creating a struct field called metadata and write data in json format.
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData) as metadata").write.format("json").mode("overwrite").save("json_path")
//using .to_json to create json object in dataframe
df.selectExpr("to_json(struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData))metadata").show(false)
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|metadata |
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"exporttime":"2020-03-21T15:17:54.769-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}|
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//using .toJSON to view json in shell(non-prod use only)
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData)metadata").toJSON.collect()
//Array[String] = Array({"metadata":{"exporttime":"2020-03-21T15:19:35.890-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}})
您好,我必须使用MongoSpark驱动程序(
MongoSpark.load
)阅读,该驱动程序没有.toDS
方法。我仍然尝试了您的解决方案,它几乎回答了我的问题。我得到:“allBusinessData”:{
而不是所需的“allBusinessData”:[{…
(注意开头的方括号[)@用户1485864,您不必担心.toDS
…对于我使用.toDS
读取为.json
的示例数据,您的df是json格式的。
val df=spark.read.json(Seq("""[{"businessData":{"capacity":{"fuelCapacity":282}}},{"businessData":{"capacity":{"fuelCapacity":456}}}""").toDS)
//creating a struct field called metadata and write data in json format.
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData) as metadata").write.format("json").mode("overwrite").save("json_path")
//using .to_json to create json object in dataframe
df.selectExpr("to_json(struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData))metadata").show(false)
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|metadata |
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"exporttime":"2020-03-21T15:17:54.769-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}|
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//using .toJSON to view json in shell(non-prod use only)
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData)metadata").toJSON.collect()
//Array[String] = Array({"metadata":{"exporttime":"2020-03-21T15:19:35.890-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}})