Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/mongodb/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
将Spark dataframe导出为带有自定义元数据的JSon数组_Json_Mongodb_Scala_Apache Spark - Fatal编程技术网

将Spark dataframe导出为带有自定义元数据的JSon数组

将Spark dataframe导出为带有自定义元数据的JSon数组,json,mongodb,scala,apache-spark,Json,Mongodb,Scala,Apache Spark,我在MongoDB中存储了一些JSON文档。每个文档看起来像:{“businessData”:{“capacity”:{“fuelCapacity”:282},…} 阅读完所有文档后,我想将它们导出为有效的JSON文件。具体来说: //从数据库读取JSON数据 val df:DataFrame=MongoSpark.load(sparkSession,readConfig) df.show //导出到文件系统中 coalesce(1.write.mode(SaveMode.Overwrite).

我在MongoDB中存储了一些JSON文档。每个文档看起来像:
{“businessData”:{“capacity”:{“fuelCapacity”:282},…}

阅读完所有文档后,我想将它们导出为有效的JSON文件。具体来说:

//从数据库读取JSON数据
val df:DataFrame=MongoSpark.load(sparkSession,readConfig)
df.show
//导出到文件系统中
coalesce(1.write.mode(SaveMode.Overwrite).json(“export.json”)
但当我导出到文件系统时,我希望将这5行合并到一个数组中,并添加一些自定义元数据。例如:

{
  "metadata" : { "exportTime": "20/20/2020" , ...} 
  "allBusinessData" : [
    {"businessData":{"capacity":{"fuelCapacity":282}, ..},
    // all 5 rows from above
  ]
}
我看到了一些问题和建议,它们也部分地回答了这个问题,因为没有向导出添加自定义json结构

然而,假设这是我唯一可以继续下去的方法,我该怎么做呢


非常感谢!

来自Spark-2.2+:

您可以尝试在spark中使用(或)创建
struct
字段,然后以json格式编写df以获得所需的输出

  • 对于示例数据,我假设exportedtime为当前_时间戳()
示例:

val df=spark.read.json(Seq("""[{"businessData":{"capacity":{"fuelCapacity":282}}},{"businessData":{"capacity":{"fuelCapacity":456}}}""").toDS)

//creating a struct field called metadata and write data in json format.
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData) as metadata").write.format("json").mode("overwrite").save("json_path")

//using .to_json to create json object in dataframe
df.selectExpr("to_json(struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData))metadata").show(false)

//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|metadata                                                                                                                                               |
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"exporttime":"2020-03-21T15:17:54.769-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}|
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+

//using  .toJSON to view json in shell(non-prod use only)
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData)metadata").toJSON.collect()

//Array[String] = Array({"metadata":{"exporttime":"2020-03-21T15:19:35.890-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}})

您好,我必须使用MongoSpark驱动程序(
MongoSpark.load
)阅读,该驱动程序没有
.toDS
方法。我仍然尝试了您的解决方案,它几乎回答了我的问题。我得到:
“allBusinessData”:{
而不是所需的
“allBusinessData”:[{…
(注意开头的方括号[)@用户1485864,您不必担心
.toDS
…对于我使用
.toDS
读取为
.json
的示例数据,您的df是json格式的。
val df=spark.read.json(Seq("""[{"businessData":{"capacity":{"fuelCapacity":282}}},{"businessData":{"capacity":{"fuelCapacity":456}}}""").toDS)

//creating a struct field called metadata and write data in json format.
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData) as metadata").write.format("json").mode("overwrite").save("json_path")

//using .to_json to create json object in dataframe
df.selectExpr("to_json(struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData))metadata").show(false)

//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|metadata                                                                                                                                               |
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"exporttime":"2020-03-21T15:17:54.769-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}|
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+

//using  .toJSON to view json in shell(non-prod use only)
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData)metadata").toJSON.collect()

//Array[String] = Array({"metadata":{"exporttime":"2020-03-21T15:19:35.890-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}})