Java 在Apacke Spark中使用StructType创建JSON模式
我正在尝试为下面的JSON创建StructType模式Java 在Apacke Spark中使用StructType创建JSON模式,java,apache-spark,Java,Apache Spark,我正在尝试为下面的JSON创建StructType模式 { "countries":{ "country":[ { "area":9596960, "cities":{ }, "name":"China", "population":1210004992 }, { "a
{
"countries":{
"country":[
{
"area":9596960,
"cities":{
},
"name":"China",
"population":1210004992
},
{
"area":3287590,
"cities":{
},
"name":"India",
"population":952107712
},
{
"area":9372610,
"cities":{
"city":[
{
"name":"New York",
"population":7380906
},
{
"name":"Los Angeles",
"population":3553638
},
{
"name":"Chicago",
"population":2721547
},
{
"name":"Detroit",
"population":1000272
}
]
},
"name":"United States",
"population":266476272
},
{
"area":1919440,
"cities":{
"city":[
{
"name":"Jakarta",
"population":8259266
},
{
"name":"Surabaya",
"population":2483871
},
{
"name":"Bandung",
"population":2058649
},
{
"name":"Medan",
"population":1730752
},
{
"name":"Semarang",
"population":1250971
},
{
"name":"Palembang",
"population":1144279
}
]
},
"name":"Indonesia",
"population":206611600
}
]
}
}
我正在做下面的代码来获取所有国家的名称
DataTypes.createStructField("countries", (new StructType()).add(DataTypes.createStructField("country",
(new StructType()).add(DataTypes.createStructField("name", DataTypes.StringType, true)), true)), true)
但是当我在下面跑去弄所有国家的名字时
Dataset<Row> namesDF = spark.sql("SELECT countries FROM country");
namesDF.show();
入口点
Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
.json(context.getProperty(GlobalConstants.ReadyJsonFile));
ds.printSchema();
ds.createOrReplaceTempView("country_data");
ds.sqlContext().sql("SELECT country.name FROM country_data lateral view explode(countries.country) t as country").show(false);
为什么它显示空名称。。?我用的是spark 2.4.4
模式发现
root
|-- countries: struct (nullable = true)
| |-- country: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- area: double (nullable = true)
| | | |-- cities: struct (nullable = true)
| | | | |-- city: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- population: long (nullable = true)
在json
country
字段中包含数组而不是结构,因此会导致模式不匹配。您应该使用ArrayType
创建模式,如下所示:
DataTypes.createStructField("countries",
new StructType().add(DataTypes.createStructField("country",
new ArrayType(new StructType().add(DataTypes.createStructField("name",
DataTypes.StringType, true)), true), true)), true)
使用此模式,您将获得以下国家:
df.registerTempTable("country_data");
spark.sql("SELECT countries FROM country_data").show();
+--------------------------------------------------------------+
|countries |
+--------------------------------------------------------------+
|[WrappedArray([China], [India], [United States], [Indonesia])]|
+--------------------------------------------------------------+
如果要列出阵列中的所有国家/地区,应使用分解:
spark.sql("SELECT country.name FROM country_data lateral view explode(countries.country) t as country").show(false)
+-------------+
|name |
+-------------+
|China |
|India |
|United States|
|Indonesia |
+-------------+
它在输出中不显示任何条目,请检查有问题的更新。什么是GlobalConstants.ReadyJsonFile
?你的文件是包含格式很好的json(就像你的文章中那样)还是只包含一行json?问题是我的json不是一行json,这就是它失败的原因。我可以知道吗,为什么会这样吗?你能检查一下这个问题吗?我几乎完成了提取,但陷入了一个嵌套的问题element@JonLe你的代码看起来很复杂。为什么不使用模式发现?我的意思是val schema=spark.read.json(Seq(json).toDS.rdd).schema.json
和thanorg.apache.spark.sql.types.DataType.fromJson(schema)
?
df.registerTempTable("country_data");
spark.sql("SELECT countries FROM country_data").show();
+--------------------------------------------------------------+
|countries |
+--------------------------------------------------------------+
|[WrappedArray([China], [India], [United States], [Indonesia])]|
+--------------------------------------------------------------+
spark.sql("SELECT country.name FROM country_data lateral view explode(countries.country) t as country").show(false)
+-------------+
|name |
+-------------+
|China |
|India |
|United States|
|Indonesia |
+-------------+