Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 将Dataset中的嵌套json字符串转换为Spark Scala中的Dataset/Dataframe_Apache Spark_Apache Spark Sql_Dataset_Apache Spark Dataset - Fatal编程技术网

Apache spark 将Dataset中的嵌套json字符串转换为Spark Scala中的Dataset/Dataframe

Apache spark 将Dataset中的嵌套json字符串转换为Spark Scala中的Dataset/Dataframe,apache-spark,apache-spark-sql,dataset,apache-spark-dataset,Apache Spark,Apache Spark Sql,Dataset,Apache Spark Dataset,我有一个简单的程序,该程序将Dataset与列resource_序列化,并将JSON字符串作为值,如下所示: import org.apache.spark.SparkConf object TestApp { def main(args: Array[String]): Unit = { val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

我有一个简单的程序,该程序将Dataset与列resource_序列化,并将JSON字符串作为值,如下所示:

import org.apache.spark.SparkConf

object TestApp {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

    val spark = org.apache.spark.sql.SparkSession
      .builder
      .config(sparkConf)
      .appName("Test")
      .getOrCreate()

    val json = "[{\"resource_serialized\":\"{\\\"createdOn\\\":\\\"2000-07-20 00:00:00.0\\\",\\\"genderCode\\\":\\\"0\\\"}\",\"id\":\"00529e54-0f3d-4c76-9d3\"}]"

    import spark.implicits._
    val df = spark.read.json(Seq(json).toDS)
    df.printSchema()
    df.show()
  }
}
打印的模式为:

root
 |-- id: string (nullable = true)
 |-- resource_serialized: string (nullable = true)
控制台上打印的数据集是:

+--------------------+--------------------+
|                  id| resource_serialized|
+--------------------+--------------------+
|00529e54-0f3d-4c7...|{"createdOn":"200...|
+--------------------+--------------------+
resource_序列化字段具有来自调试控制台的json字符串

现在,我需要用这个json字符串创建dataset/dataframe,我该如何实现呢

我的目标是获得如下数据集:

+--------------------+--------------------+----------+
|                  id|           createdOn|genderCode|
+--------------------+--------------------+----------+
|00529e54-0f3d-4c7...|2000-07-20 00:00    |         0|
+--------------------+--------------------+----------+
使用from_json函数将json字符串转换为df列

例如:

如果您有一个有效的json,我们可以直接在spark.read.json中使用模式读取json

使用from_json函数将json字符串转换为df列

例如:

如果您有一个有效的json,我们可以直接在spark.read.json中使用模式读取json


下面的解决方案将允许您将序列化的资源_的所有键值映射到字符串,字符串表稍后可以解析映射

import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}

object TestApp {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

    val spark = org.apache.spark.sql.SparkSession
      .builder
      .config(sparkConf)
      .appName("Test")
      .getOrCreate()

    val json = "[{\"resource_serialized\":\"{\\\"createdOn\\\":\\\"2000-07-20 00:00:00.0\\\",\\\"genderCode\\\":\\\"0\\\"}\",\"id\":\"00529e54-0f3d-4c76-9d3\"}]"

    import spark.implicits._
    val df = spark.read.json(Seq(json).toDS)
    val jsonColumn = from_json($"resource_serialized", MapType(StringType, StringType))
    val keysDF = df.select(explode(map_keys(jsonColumn))).distinct()
    val keys = keysDF.collect().map(f=>f.get(0))
    val keyCols = keys.map(f=> jsonColumn.getItem(f).as(f.toString))
    df.select( $"id" +: keyCols:_*).show(false)

  }
}


输出看起来像

+----------------------+---------------------+----------+
|id                    |createdOn            |genderCode|
+----------------------+---------------------+----------+
|00529e54-0f3d-4c76-9d3|2000-07-20 00:00:00.0|0         |
+----------------------+---------------------+----------+

下面的解决方案将允许您将序列化的资源_的所有键值映射到字符串,字符串表稍后可以解析映射

import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}

object TestApp {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

    val spark = org.apache.spark.sql.SparkSession
      .builder
      .config(sparkConf)
      .appName("Test")
      .getOrCreate()

    val json = "[{\"resource_serialized\":\"{\\\"createdOn\\\":\\\"2000-07-20 00:00:00.0\\\",\\\"genderCode\\\":\\\"0\\\"}\",\"id\":\"00529e54-0f3d-4c76-9d3\"}]"

    import spark.implicits._
    val df = spark.read.json(Seq(json).toDS)
    val jsonColumn = from_json($"resource_serialized", MapType(StringType, StringType))
    val keysDF = df.select(explode(map_keys(jsonColumn))).distinct()
    val keys = keysDF.collect().map(f=>f.get(0))
    val keyCols = keys.map(f=> jsonColumn.getItem(f).as(f.toString))
    df.select( $"id" +: keyCols:_*).show(false)

  }
}


输出看起来像

+----------------------+---------------------+----------+
|id                    |createdOn            |genderCode|
+----------------------+---------------------+----------+
|00529e54-0f3d-4c76-9d3|2000-07-20 00:00:00.0|0         |
+----------------------+---------------------+----------+

这太酷了!谢谢如果我没有StructType,有没有办法映射它?这太酷了!谢谢如果我没有StructType,有没有一种方法可以映射它?我们可以在不定义StructType的情况下将它映射到columr表单中吗?就像Shu在回答中提到的那样?是的,检查一下,只需在代码中添加我的行就可以了。现在我已经添加了完整的代码我已经用预期的表格更新了我的问题,你介意检查一下吗?更新解决方案@YogenRai如果它回答了请upvote+accept它我们可以将它映射到columr表单中而不必像Shu在回答中提到的那样定义StructType吗?是的,检查只需在代码中添加我的行,它就会工作。现在我已经添加了完整的代码我已经用预期的表格更新了我的问题,你介意检查一下吗?更新解决方案@YogenRai如果它回答了,请upvote+接受它