如何在spark中编写有效的json_Json_Apache Spark_Apache Spark Sql_Apache Spark Dataset_Apache Spark 2.0

如何在spark中编写有效的json

json apache-spark

如何在spark中编写有效的json,json,apache-spark,apache-spark-sql,apache-spark-dataset,apache-spark-2.0,Json,Apache Spark,Apache Spark Sql,Apache Spark Dataset,Apache Spark 2.0,我需要编写有效的json，但spark允许一次编写一行，如： {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}} {"name":"Michael", "address":{"city":null, "state":"California"}} 上面的Json无效。相反，我需要这个： { {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}, {"name"

我需要编写有效的json，但spark允许一次编写一行，如：

{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}

上面的Json无效。相反，我需要这个：

{
{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
{"name":"Michael", "address":{"city":null, "state":"California"}}
}

如何在java中实现它？

首先从将

DataFrame

行转换为json开始：

Scala

val jsonDs = df.toJSON

Dataset<String> jsonDs = simpleProf.toJSON();

val count = jsonDs.count()
jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .rdd
  .zipWithIndex()
  .map { case(json, idx) =>
      if(idx == 0) "[\n" + json + "," // first row
      else if(idx == count-1) json + "\n]" // last row
      else json + ","
  }
  .saveAsTextFile("path")

jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .javaRDD()
  .zipWithIndex()
  .map(t -> t._2 == 0 ? "[\n" + t._1 + "," : t._2 == count-1 ? t._1 + "\n]" : t._1 + ",")
  .saveAsTextFile("path");

jsonDs
  .mapPartitions(vals => Iterator("[" + vals.mkString(",") + "]"))
  .write
  .text("path")

import org.apache.commons.lang3.StringUtils;

jsonDs
  .mapPartitions(input -> Arrays.asList("[" + StringUtils.join(input, ",") + "]").iterator(), Encoders.STRING())
  .write()
  .text("path");

Java

val jsonDs = df.toJSON

Dataset<String> jsonDs = simpleProf.toJSON();

val count = jsonDs.count()
jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .rdd
  .zipWithIndex()
  .map { case(json, idx) =>
      if(idx == 0) "[\n" + json + "," // first row
      else if(idx == count-1) json + "\n]" // last row
      else json + ","
  }
  .saveAsTextFile("path")

jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .javaRDD()
  .zipWithIndex()
  .map(t -> t._2 == 0 ? "[\n" + t._1 + "," : t._2 == count-1 ? t._1 + "\n]" : t._1 + ",")
  .saveAsTextFile("path");

jsonDs
  .mapPartitions(vals => Iterator("[" + vals.mkString(",") + "]"))
  .write
  .text("path")

import org.apache.commons.lang3.StringUtils;

jsonDs
  .mapPartitions(input -> Arrays.asList("[" + StringUtils.join(input, ",") + "]").iterator(), Encoders.STRING())
  .write()
  .text("path");

接下来的步骤取决于每个分区是保存到一个文件还是多个文件

保存到一个JSON文件 Scala

val jsonDs = df.toJSON

Dataset<String> jsonDs = simpleProf.toJSON();

val count = jsonDs.count()
jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .rdd
  .zipWithIndex()
  .map { case(json, idx) =>
      if(idx == 0) "[\n" + json + "," // first row
      else if(idx == count-1) json + "\n]" // last row
      else json + ","
  }
  .saveAsTextFile("path")

jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .javaRDD()
  .zipWithIndex()
  .map(t -> t._2 == 0 ? "[\n" + t._1 + "," : t._2 == count-1 ? t._1 + "\n]" : t._1 + ",")
  .saveAsTextFile("path");

jsonDs
  .mapPartitions(vals => Iterator("[" + vals.mkString(",") + "]"))
  .write
  .text("path")

import org.apache.commons.lang3.StringUtils;

jsonDs
  .mapPartitions(input -> Arrays.asList("[" + StringUtils.join(input, ",") + "]").iterator(), Encoders.STRING())
  .write()
  .text("path");

Java

val jsonDs = df.toJSON

Dataset<String> jsonDs = simpleProf.toJSON();

val count = jsonDs.count()
jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .rdd
  .zipWithIndex()
  .map { case(json, idx) =>
      if(idx == 0) "[\n" + json + "," // first row
      else if(idx == count-1) json + "\n]" // last row
      else json + ","
  }
  .saveAsTextFile("path")

jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .javaRDD()
  .zipWithIndex()
  .map(t -> t._2 == 0 ? "[\n" + t._1 + "," : t._2 == count-1 ? t._1 + "\n]" : t._1 + ",")
  .saveAsTextFile("path");

jsonDs
  .mapPartitions(vals => Iterator("[" + vals.mkString(",") + "]"))
  .write
  .text("path")

import org.apache.commons.lang3.StringUtils;

jsonDs
  .mapPartitions(input -> Arrays.asList("[" + StringUtils.join(input, ",") + "]").iterator(), Encoders.STRING())
  .write()
  .text("path");

保存到每个分区的多个JSON文件 Scala

val jsonDs = df.toJSON

Dataset<String> jsonDs = simpleProf.toJSON();

val count = jsonDs.count()
jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .rdd
  .zipWithIndex()
  .map { case(json, idx) =>
      if(idx == 0) "[\n" + json + "," // first row
      else if(idx == count-1) json + "\n]" // last row
      else json + ","
  }
  .saveAsTextFile("path")

jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .javaRDD()
  .zipWithIndex()
  .map(t -> t._2 == 0 ? "[\n" + t._1 + "," : t._2 == count-1 ? t._1 + "\n]" : t._1 + ",")
  .saveAsTextFile("path");

jsonDs
  .mapPartitions(vals => Iterator("[" + vals.mkString(",") + "]"))
  .write
  .text("path")

import org.apache.commons.lang3.StringUtils;

jsonDs
  .mapPartitions(input -> Arrays.asList("[" + StringUtils.join(input, ",") + "]").iterator(), Encoders.STRING())
  .write()
  .text("path");

Java

val jsonDs = df.toJSON

Dataset<String> jsonDs = simpleProf.toJSON();

val count = jsonDs.count()
jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .rdd
  .zipWithIndex()
  .map { case(json, idx) =>
      if(idx == 0) "[\n" + json + "," // first row
      else if(idx == count-1) json + "\n]" // last row
      else json + ","
  }
  .saveAsTextFile("path")

jsonDs
  .repartition(1) // make sure it is only one partition and in consequence one output file
  .javaRDD()
  .zipWithIndex()
  .map(t -> t._2 == 0 ? "[\n" + t._1 + "," : t._2 == count-1 ? t._1 + "\n]" : t._1 + ",")
  .saveAsTextFile("path");

jsonDs
  .mapPartitions(vals => Iterator("[" + vals.mkString(",") + "]"))
  .write
  .text("path")

import org.apache.commons.lang3.StringUtils;

jsonDs
  .mapPartitions(input -> Arrays.asList("[" + StringUtils.join(input, ",") + "]").iterator(), Encoders.STRING())
  .write()
  .text("path");

在不知道您已经拥有了什么的情况下很难说出您在问什么，但这是否有帮助：？我需要将配置单元表转换为XML，但我在这方面遇到了各种问题。所以首先我将配置单元表转换为json，然后我将直接将json转换为XML。但当我将配置单元转换为Json时，我发现Json无效。因此，只需将其转换为有效的文件即可。如果将其重新分区为1，则多个json文件的解决方案应适用于一个文件（最好使用coalesce（1）而不是重新分区（1））。请注意，当更改为一个分区时，所有内容都必须适合该执行器。此外，您还缺少对象周围的{和}。请提供“保存到一个JSON文件”的java代码，因为我不熟悉java中的lambda表达式。您好，我添加了java实现。