Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
无法读取json文件:使用java的Spark结构化流媒体_Java_Json_Apache Spark_Spark Streaming_Spark Structured Streaming - Fatal编程技术网

无法读取json文件:使用java的Spark结构化流媒体

无法读取json文件:使用java的Spark结构化流媒体,java,json,apache-spark,spark-streaming,spark-structured-streaming,Java,Json,Apache Spark,Spark Streaming,Spark Structured Streaming,我有一个python脚本,它每分钟从纽约证券交易所获取一个新文件(单行)中的股票数据(如下所示)。它包含4种股票的数据——MSFT、ADBE、GOOGL和FB,如下json格式 [{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "

我有一个python脚本,它每分钟从纽约证券交易所获取一个新文件(单行)中的股票数据(如下所示)。它包含4种股票的数据——MSFT、ADBE、GOOGL和FB,如下json格式

[{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "57081"}}, {"symbol": "ADBE", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "279.2900", "high": "279.3400", "low": "279.2600", "close": "279.3050", "volume": "12711"}}, {"symbol": "GOOGL", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "1166.4100", "high": "1166.7400", "low": "1166.2900", "close": "1166.7400", "volume": "8803"}}, {"symbol": "FB", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "192.4200", "high": "192.5000", "low": "192.3600", "close": "192.4800", "volume": "33490"}}]
我正在尝试将此文件流读入Spark流数据帧。但是我不能为它定义合适的模式。到目前为止,调查了互联网并做了以下工作

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;



public class Driver1 {

    public static void main(String args[]) throws InterruptedException, StreamingQueryException {


        SparkSession session = SparkSession.builder().appName("Spark_Streaming").master("local[2]").getOrCreate();
        Logger.getLogger("org").setLevel(Level.ERROR);


        StructType priceData = new StructType()
                .add("open", DataTypes.DoubleType)
                .add("high", DataTypes.DoubleType)
                .add("low", DataTypes.DoubleType)
                .add("close", DataTypes.DoubleType)
                .add("volume", DataTypes.LongType);

        StructType schema = new StructType()
                .add("symbol", DataTypes.StringType)
                .add("timestamp", DataTypes.StringType)
                .add("stock", priceData);


        Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
        rawData.printSchema();
        rawData.writeStream().format("console").start().awaitTermination();
        session.close();        

    }
我甚至尝试先将json字符串作为文本文件读取,然后应用模式(就像Kafka流媒体一样)


请帮我弄清楚

刚想好,记住以下两件事-

  • 定义模式时,请确保名称和顺序字段与json文件中的字段完全相同

  • 最初,仅对所有字段使用
    StringType
    ,您可以应用转换将其更改回某些特定的数据类型

  • 这对我来说很有用-

        StructType priceData = new StructType()
                .add("open", DataTypes.StringType)
                .add("high", DataTypes.StringType)
                .add("low", DataTypes.StringType)
                .add("close", DataTypes.StringType)
                .add("volume", DataTypes.StringType);
    
        StructType schema = new StructType()
                .add("symbol", DataTypes.StringType)
                .add("timestamp", DataTypes.StringType)
                .add("priceData", priceData);
    
    
        Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
        rawData.writeStream().format("console").start().awaitTermination();
        session.close();
    
    现在,您可以使用
    priceData.open
    priceData.close
    等方法展平priceData列

      Dataset<Row> rawData = session.readStream().format("text").load("/home/abhinavrawat/streamingData/data/*");
        Dataset<Row> raw2 = rawData.select(org.apache.spark.sql.functions.from_json(rawData.col("value"),schema)); 
    raw2.writeStream().format("console").start().awaitTermination();
    
    +--------------------+
    |jsontostructs(value)|
    +--------------------+
    |                null|
    |                null|
    |                null|
    |                null|
    |                null|
    
        StructType priceData = new StructType()
                .add("open", DataTypes.StringType)
                .add("high", DataTypes.StringType)
                .add("low", DataTypes.StringType)
                .add("close", DataTypes.StringType)
                .add("volume", DataTypes.StringType);
    
        StructType schema = new StructType()
                .add("symbol", DataTypes.StringType)
                .add("timestamp", DataTypes.StringType)
                .add("priceData", priceData);
    
    
        Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
        rawData.writeStream().format("console").start().awaitTermination();
        session.close();
    
    +------+-------------------+--------------------+
    |symbol|          timestamp|           priceData|
    +------+-------------------+--------------------+
    |  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
    |  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
    | GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
    |    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
    |  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
    |  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
    | GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
    |    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
    |  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
    |  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
    | GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
    |    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
    |  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
    |  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
    | GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
    |    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
    |  MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
    |  ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
    | GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
    |    FB|2019-05-02 15:59:00|[192.4200, 192.50...|
    +------+-------------------+--------------------+