Java 在Apacke Spark中使用StructType创建JSON模式_Java_Apache Spark

Java 在Apacke Spark中使用StructType创建JSON模式

java apache-spark

Java 在Apacke Spark中使用StructType创建JSON模式,java,apache-spark,Java,Apache Spark,我正在尝试为下面的JSON创建StructType模式 { "countries":{ "country":[ { "area":9596960, "cities":{ }, "name":"China", "population":1210004992 }, { "a

我正在尝试为下面的JSON创建StructType模式

{ 
   "countries":{ 
      "country":[ 
         { 
            "area":9596960,
            "cities":{ 

            },
            "name":"China",
            "population":1210004992
         },
         { 
            "area":3287590,
            "cities":{ 

            },
            "name":"India",
            "population":952107712
         },
         { 
            "area":9372610,
            "cities":{ 
               "city":[ 
                  { 
                     "name":"New York",
                     "population":7380906
                  },
                  { 
                     "name":"Los Angeles",
                     "population":3553638
                  },
                  { 
                     "name":"Chicago",
                     "population":2721547
                  },
                  { 
                     "name":"Detroit",
                     "population":1000272
                  }
               ]
            },
            "name":"United States",
            "population":266476272
         },
         { 
            "area":1919440,
            "cities":{ 
               "city":[ 
                  { 
                     "name":"Jakarta",
                     "population":8259266
                  },
                  { 
                     "name":"Surabaya",
                     "population":2483871
                  },
                  { 
                     "name":"Bandung",
                     "population":2058649
                  },
                  { 
                     "name":"Medan",
                     "population":1730752
                  },
                  { 
                     "name":"Semarang",
                     "population":1250971
                  },
                  { 
                     "name":"Palembang",
                     "population":1144279
                  }
               ]
            },
            "name":"Indonesia",
            "population":206611600
         }
      ]
   }
}

我正在做下面的代码来获取所有国家的名称

DataTypes.createStructField("countries", (new StructType()).add(DataTypes.createStructField("country",
                    (new StructType()).add(DataTypes.createStructField("name", DataTypes.StringType, true)), true)), true)

但是当我在下面跑去弄所有国家的名字时

Dataset<Row> namesDF = spark.sql("SELECT countries FROM country");
        namesDF.show();

入口点

Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
                .json(context.getProperty(GlobalConstants.ReadyJsonFile));

        ds.printSchema();

        ds.createOrReplaceTempView("country_data");
        ds.sqlContext().sql("SELECT country.name FROM country_data lateral view explode(countries.country) t as country").show(false);

为什么它显示空名称。。？我用的是spark 2.4.4

模式发现

root
 |-- countries: struct (nullable = true)
 |    |-- country: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- area: double (nullable = true)
 |    |    |    |-- cities: struct (nullable = true)
 |    |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- population: long (nullable = true)

在json

country

字段中包含数组而不是结构，因此会导致模式不匹配。您应该使用

ArrayType

创建模式，如下所示：

DataTypes.createStructField("countries", 
    new StructType().add(DataTypes.createStructField("country",
                    new ArrayType(new StructType().add(DataTypes.createStructField("name", 
    DataTypes.StringType, true)), true), true)), true)

使用此模式，您将获得以下国家：

df.registerTempTable("country_data");
spark.sql("SELECT countries FROM country_data").show();
+--------------------------------------------------------------+
|countries                                                     |
+--------------------------------------------------------------+
|[WrappedArray([China], [India], [United States], [Indonesia])]|
+--------------------------------------------------------------+

如果要列出阵列中的所有国家/地区，应使用

分解：
spark.sql("SELECT country.name FROM country_data lateral view explode(countries.country) t as country").show(false)
+-------------+
|name         |
+-------------+
|China        |
|India        |
|United States|
|Indonesia    |
+-------------+

它在输出中不显示任何条目，请检查有问题的更新。什么是GlobalConstants.ReadyJsonFile
？你的文件是包含格式很好的json（就像你的文章中那样）还是只包含一行json？问题是我的json不是一行json，这就是它失败的原因。我可以知道吗，为什么会这样吗？你能检查一下这个问题吗？我几乎完成了提取，但陷入了一个嵌套的问题element@JonLe你的代码看起来很复杂。为什么不使用模式发现？我的意思是val schema=spark.read.json（Seq（json）.toDS.rdd）.schema.json
和thanorg.apache.spark.sql.types.DataType.fromJson（schema）？
df.registerTempTable("country_data");
spark.sql("SELECT countries FROM country_data").show();
+--------------------------------------------------------------+
|countries                                                     |
+--------------------------------------------------------------+
|[WrappedArray([China], [India], [United States], [Indonesia])]|
+--------------------------------------------------------------+

spark.sql("SELECT country.name FROM country_data lateral view explode(countries.country) t as country").show(false)
+-------------+
|name         |
+-------------+
|China        |
|India        |
|United States|
|Indonesia    |
+-------------+