Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/396.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 在spark中写入JSON时保留具有空值的键_Java_Json_Apache Spark_Apache Spark Sql - Fatal编程技术网

Java 在spark中写入JSON时保留具有空值的键

Java 在spark中写入JSON时保留具有空值的键,java,json,apache-spark,apache-spark-sql,Java,Json,Apache Spark,Apache Spark Sql,我正在尝试使用spark编写JSON文件。有些键的值为null。它们在数据集中显示得很好,但是当我编写文件时,键会被删除。我如何确保它们被保留 编写文件的代码: ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee"); 来自源的部分JSON数据: "event_header": { "accept_language": null,

我正在尝试使用spark编写JSON文件。有些键的值为
null
。它们在
数据集中显示得很好,但是当我编写文件时,键会被删除。我如何确保它们被保留

编写文件的代码:

ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee");
来自源的部分JSON数据:

"event_header": {
        "accept_language": null,
        "app_id": "App_ID",
        "app_name": null,
        "client_ip_address": "IP",
        "event_id": "ID",
        "event_timestamp": null,
        "offering_id": "Offering",
        "server_ip_address": "IP",
        "server_timestamp": 1492565987565,
        "topic_name": "Topic",
        "version": "1.0"
    }
输出:

"event_header": {
        "app_id": "App_ID",
        "client_ip_address": "IP",
        "event_id": "ID",
        "offering_id": "Offering",
        "server_ip_address": "IP",
        "server_timestamp": 1492565987565,
        "topic_name": "Topic",
        "version": "1.0"
    }

在上面的示例中,键
accept_language
app_name
event_timestamp
已被删除。

显然,spark不提供任何处理空值的选项。因此,下面的自定义解决方案应该是可行的

import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper

case class EventHeader(accept_language:String,app_id:String,app_name:String,client_ip_address:String,event_id: String,event_timestamp:String,offering_id:String,server_ip_address:String,server_timestamp:Long,topic_name:String,version:String)

val ds = Seq(EventHeader(null,"App_ID",null,"IP","ID",null,"Offering","IP",1492565987565L,"Topic","1.0")).toDS()

val ds1 = ds.mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
records.map(mapper.writeValueAsString(_))
})

ds1.coalesce(1).write.text("hdfs://localhost:9000/user/dedupe_employee")
这将产生如下输出:

{"accept_language":null,"app_id":"App_ID","app_name":null,"client_ip_address":"IP","event_id":"ID","event_timestamp":null,"offering_id":"Offering","server_ip_address":"IP","server_timestamp":1492565987565,"topic_name":"Topic","version":"1.0"}

如果您在Spark 3上,您可以添加

spark.sql.jsonGenerator.ignoreNullFields false

你能在pyspark中提供这样的解决方案吗?谢谢