Java 如何推断JSON文件的模式？_Java_Json_Apache Spark_Spark Streaming

Java 如何推断JSON文件的模式？

java json apache-spark

Java 如何推断JSON文件的模式？,java,json,apache-spark,spark-streaming,Java,Json,Apache Spark,Spark Streaming,我有以下Java字符串 { "header": { "gtfs_realtime_version": "1.0", "incrementality": 0, "timestamp": 1528460625, "user-data": "metra" }, "entity": [{ "id": "8424", "vehicle": {

我有以下Java字符串

{
    "header": {
        "gtfs_realtime_version": "1.0",
        "incrementality": 0,
        "timestamp": 1528460625,
        "user-data": "metra"
    },
    "entity": [{
            "id": "8424",
            "vehicle": {
                "trip": {
                    "trip_id": "UP-N_UN314_V1_D",
                    "route_id": "UP-N",
                    "start_time": "06:17:00",
                    "start_date": "20180608",
                    "schedule_relationship": 0
                },
                "vehicle": {
                    "id": "8424",
                    "label": "314"
                },
                "position": {
                    "latitude": 42.10085,
                    "longitude": -87.72896
                },
                "current_status": 2,
                "timestamp": 1528460601
            }
        }
    ]
}

表示JSON文档的。我想在流式应用程序的Spark数据帧中推断一个模式

如何像CSV文档一样拆分字符串字段（我可以调用
.split（“”
）？
引用官方文档：
默认情况下，来自基于文件的源的结构化流需要您指定模式，而不是依赖Spark自动推断模式。此限制确保流式查询使用一致的模式，即使在失败的情况下也是如此。对于特殊用例，您可以通过将
spark.sql.streaming.schemaReference
设置为true来重新启用模式推断
然后可以使用
spark.sql.streaming.schemaReference
配置属性来启用模式推断。我不确定这是否适用于JSON文件
我通常会加载一个文件（在批处理查询中，在开始流式查询之前）来推断模式。这对你来说应该有用。只需执行以下操作

// I'm leaving converting Scala to Java as a home exercise val jsonSchema = spark .read .option("multiLine", true) // <-- the trick .json("sample.json") .schema scala> jsonSchema.printTreeString root |-- entity: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- vehicle: struct (nullable = true) | | | |-- current_status: long (nullable = true) | | | |-- position: struct (nullable = true) | | | | |-- latitude: double (nullable = true) | | | | |-- longitude: double (nullable = true) | | | |-- timestamp: long (nullable = true) | | | |-- trip: struct (nullable = true) | | | | |-- route_id: string (nullable = true) | | | | |-- schedule_relationship: long (nullable = true) | | | | |-- start_date: string (nullable = true) | | | | |-- start_time: string (nullable = true) | | | | |-- trip_id: string (nullable = true) | | | |-- vehicle: struct (nullable = true) | | | | |-- id: string (nullable = true) | | | | |-- label: string (nullable = true) |-- header: struct (nullable = true) | |-- gtfs_realtime_version: string (nullable = true) | |-- incrementality: long (nullable = true) | |-- timestamp: long (nullable = true) | |-- user-data: string (nullable = true)

//我将把Scala到Java的转换作为一个家庭练习 val jsonSchema=火花阅读 .option（“multiLine”，true）//jsonSchema.printTreeString 根 |--实体：数组（nullable=true） ||--元素：struct（containsnall=true） || |--id:string（nullable=true） || |--vehicle:struct（nullable=true） || | |--当前|U状态：long（nullable=true） || | |--位置：struct（nullable=true） || | | |--纬度：双精度（nullable=true） || | | |--经度：双精度（nullable=true） || | |--timestamp:long（nullable=true） || | |--trip:struct（nullable=true） || | | |--route|u id:string（nullable=true） || | | |--schedule|u关系：long（nullable=true） || | | |--开始日期：字符串（nullable=true） || | | |--开始时间：字符串（nullable=true） || | | |--trip|u id:string（nullable=true） || | |--vehicle:struct（nullable=true） || | | |--id:string（nullable=true） || | | |--label:string（nullable=true） |--header:struct（nullable=true） ||--gtfs_realtime_version:string（nullable=true） ||--递增：长（nullable=true） ||--timestamp:long（nullable=true） ||--用户数据：字符串（nullable=true）

诀窍是使用
多行
选项，使整个文件成为用于推断模式的单行。
引用官方文档：
默认情况下，来自基于文件的源的结构化流需要您指定模式，而不是依赖Spark自动推断模式。此限制确保流式查询使用一致的模式，即使在失败的情况下也是如此。对于特殊用例，您可以通过将
spark.sql.streaming.schemaReference
设置为true来重新启用模式推断
然后可以使用
spark.sql.streaming.schemaReference
配置属性来启用模式推断。我不确定这是否适用于JSON文件
我通常会加载一个文件（在批处理查询中，在开始流式查询之前）来推断模式。这对你来说应该有用。只需执行以下操作

// I'm leaving converting Scala to Java as a home exercise val jsonSchema = spark .read .option("multiLine", true) // <-- the trick .json("sample.json") .schema scala> jsonSchema.printTreeString root |-- entity: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- vehicle: struct (nullable = true) | | | |-- current_status: long (nullable = true) | | | |-- position: struct (nullable = true) | | | | |-- latitude: double (nullable = true) | | | | |-- longitude: double (nullable = true) | | | |-- timestamp: long (nullable = true) | | | |-- trip: struct (nullable = true) | | | | |-- route_id: string (nullable = true) | | | | |-- schedule_relationship: long (nullable = true) | | | | |-- start_date: string (nullable = true) | | | | |-- start_time: string (nullable = true) | | | | |-- trip_id: string (nullable = true) | | | |-- vehicle: struct (nullable = true) | | | | |-- id: string (nullable = true) | | | | |-- label: string (nullable = true) |-- header: struct (nullable = true) | |-- gtfs_realtime_version: string (nullable = true) | |-- incrementality: long (nullable = true) | |-- timestamp: long (nullable = true) | |-- user-data: string (nullable = true)

//我将把Scala到Java的转换作为一个家庭练习 val jsonSchema=火花阅读 .option（“multiLine”，true）//jsonSchema.printTreeString 根 |--实体：数组（nullable=true） ||--元素：struct（containsnall=true） || |--id:string（nullable=true） || |--vehicle:struct（nullable=true） || | |--当前|U状态：long（nullable=true） || | |--位置：struct（nullable=true） || | | |--纬度：双精度（nullable=true） || | | |--经度：双精度（nullable=true） || | |--timestamp:long（nullable=true） || | |--trip:struct（nullable=true） || | | |--route|u id:string（nullable=true） || | | |--schedule|u关系：long（nullable=true） || | | |--开始日期：字符串（nullable=true） || | | |--开始时间：字符串（nullable=true） || | | |--trip|u id:string（nullable=true） || | |--vehicle:struct（nullable=true） || | | |--id:string（nullable=true） || | | |--label:string（nullable=true） |--header:struct（nullable=true） ||--gtfs_realtime_version:string（nullable=true） ||--递增：长（nullable=true） ||--timestamp:long（nullable=true） ||--用户数据：字符串（nullable=true）
诀窍是使用
多行
选项，使整个文件成为用于推断模式的单行。
使用
df=spark.read.json（r's3://mypath/'，primitivesAsString='true'）
使用

df=spark.read.json（r's3://mypath/'，primitivesAsString='true'）
是否有表示此json的case类？顺便说一句，在结构化流媒体中进行推断是不可能的，只有在batchI中没有该类，但字段的结构是标准的。。。我在Spark编程指南中看到，可以通过以下命令推断模式：
Dataset df=sparkSession.readStream（）.format（“kafka”）.option（“kafka.bootstrap.servers”，KafkaFeeds.kafkaBrokerEndpoint）。option（“subscribe”，“kafkaToSparkTopic”）.load（）在“拆分”样本json字符串后，您希望输出什么？我只想提取json文档的一些字段，例如位置字段的“纬度”值。因此，我想要一种易于拆分的格式（如CSV.split（“”），您是否有一个表示此JSON的case类？顺便提一下