Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何将JSON拆分为数据集行?_Java_Apache Spark_Apache Spark Sql_Apache Spark Dataset - Fatal编程技术网

Java 如何将JSON拆分为数据集行?

Java 如何将JSON拆分为数据集行?,java,apache-spark,apache-spark-sql,apache-spark-dataset,Java,Apache Spark,Apache Spark Sql,Apache Spark Dataset,我有以下JSON输入数据: { "lib": [ { "id": "a1", "type": "push", "icons": [ { "iId": "111" } ], "id": "a2", "type": "pull", "icons": [ { "iI

我有以下JSON输入数据:

{
    "lib": [
      {
        "id": "a1",
        "type": "push",
        "icons": [
          {
            "iId": "111"
          }
        ],
        "id": "a2",
        "type": "pull",
        "icons": [
          {
            "iId": "111"
          },
          {
            "iId": "222"
          }
        ]
      }
]
我要获取以下数据集:

id   type     iId
a1   push     111
a2   pull     111
a2   pull     222
我怎么做

这是我当前的代码。我使用Spark 2.3和Java 1.8:

ds = spark
         .read()
         .option("multiLine", true).option("mode", "PERMISSIVE")
         .json(jsonFilePath);

ds = ds
        .select(org.apache.spark.sql.functions.explode(ds.col("lib.icons")).as("icons"));
但结果是错误的:

+---------------+
|          icons|
+---------------+
|        [[111]]|
|[[111], [222...|
+---------------+
如何获得正确的数据集

更新:

我尝试了这段代码,但它生成了一些输入文件中不存在的id、type和iId的额外组合

ds = ds
      .withColumn("icons", org.apache.spark.sql.functions.explode(ds.col("lib.icons")))
      .withColumn("id", org.apache.spark.sql.functions.explode(ds.col("lib.id")))
      .withColumn("type", org.apache.spark.sql.functions.explode(ds.col("lib.type")));

ds = ds.withColumn("its",  org.apache.spark.sql.functions.explode(ds.col("icons")));

您的JSON似乎格式不正确。固定缩进使这一点更加明显:

{
  "lib": [
    {
      "id": "a1",
      "type": "push",
      "icons": [
        {
          "iId": "111"
        }
      ],
      "id": "a2",
      "type": "pull",
      "icons": [
        {
          "iId": "111"
        },
        {
          "iId": "222"
        }
      ]
    }
  ]
如果改为使用JSON,代码是否正常工作

{
  "lib": [
    {
      "id": "a1",
      "type": "push",
      "icons": [
        {
          "iId": "111"
        }
      ]
    },
    {
      "id": "a2",
      "type": "pull",
      "icons": [
        {
          "iId": "111"
        },
        {
          "iId": "222"
        }
      ]
    }
  ]
}

请注意插入的},{就在id:a2之前,将具有重复键的对象一分为二,以及前面省略的结尾处的}。

如前所述,JSON字符串的格式似乎不正确。使用更新后的版本,您可以使用以下方法获得所需的结果:

import org.apache.spark.sql.functions._

spark.read
      .format("json")
      .load("in/test.json")
      .select(explode($"lib").alias("result"))
      .select($"result.id", $"result.type", explode($"result.icons").alias("iId"))
      .select($"id", $"type", $"iId.iId")
      .show

我将您的代码应用于@MTCoster的答案中提供的JSON文件,并得到以下错误org.apache.spark.sql.AnalysisException:无法解析libalway中的列名result.id,它仅在我添加.optionmultiLine,true.optionmode,PERMISSIVE to spark时读取此JSON。否则,它表示记录已损坏。