Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/335.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java Spark Structured Streaming complete模式未按预期工作_Java_Apache Spark_Spark Structured Streaming - Fatal编程技术网

Java Spark Structured Streaming complete模式未按预期工作

Java Spark Structured Streaming complete模式未按预期工作,java,apache-spark,spark-structured-streaming,Java,Apache Spark,Spark Structured Streaming,我创建了一个结构化流媒体作业,如下所示。在第一步中,我只在INPUT_目录中添加了一个文件&它工作得很好。我的理解是,如果一个新的完全相同的文件被添加到同一个目录中,计数将加倍,因为我使用的是“完成”模式。但事实并非如此。我对“完整”模式的理解是否错误?代码如下: Dataset<StoreSales> storeSalesStream = spark .readStream() .schema(storeSalesSchema

我创建了一个结构化流媒体作业,如下所示。在第一步中,我只在INPUT_目录中添加了一个文件&它工作得很好。我的理解是,如果一个新的完全相同的文件被添加到同一个目录中,计数将加倍,因为我使用的是“完成”模式。但事实并非如此。我对“完整”模式的理解是否错误?代码如下:

    Dataset<StoreSales> storeSalesStream = spark
            .readStream()
            .schema(storeSalesSchema)
            .csv(INPUT_DIRECTORY)
            .as(Encoders.bean(StoreSales.class));


    //When data arrives from the stream, these steps will get executed

    //4 - Create a temporary table so we can use SQL queries
    storeSalesStream.createOrReplaceTempView("storeSales");

    String sql = "SELECT AVG(ss_quantity) as average_quantity, count(*) as cnt, ss_store_sk FROM storeSales GROUP BY ss_store_sk order by ss_store_sk";
    Dataset<Row> ageAverage = spark.sql(sql);

    //5 - Write the the output of the query to the console
    StreamingQuery query = ageAverage.writeStream()
            .outputMode("complete")
            .format("console")
            .start();

    query.awaitTermination();
复制新(完全相同)文件后的输出:


这是否适用于非流式批处理查询(Spark SQL)?能否显示文件的内容以及如何复制新文件?您的输入目录中有多少文件?是的,在非流媒体环境下工作正常。我使用了“TPCD”数据库()中的“stores\u sales”表进行测试。它有288804行。我作为CSV下载。我从一个CSV文件开始,该文件正确打印为“批次0”。稍后,我使用“cp tpcds\u store\u sales.csv tpcds\u store\u sales1.csv”命令在同一目录中创建同一文件的另一个副本。我的理解是结构化流媒体将检测到这个新文件&因为我使用的是“完成”模式,它将使“批处理1”中的计数加倍,但这不会发生!我是否需要指定“缓冲区大小”或“窗口大小”之类的内容?
+------------------+------+-----------+
|  average_quantity|   cnt|ss_store_sk|
+------------------+------+-----------+
| 50.60551037038176|130035|       null|
|50.550020846689414|456896|          1|
|  50.5936442439659|458159|          2|
| 50.43842163027273|458272|          4|
| 50.55502265092984|458194|          7|
| 50.47613176523048|459357|          8|
| 50.47919908681093|459492|         10|
+------------------+------+-----------+
+------------------+------+-----------+
|  average_quantity|   cnt|ss_store_sk|
+------------------+------+-----------+
| 50.63186245101668|156491|       null|
| 50.54070128518595|549167|          1|
|50.600774270842244|550477|          2|
| 50.46126613833389|550604|          4|
| 50.57143520798298|551066|          7|
| 50.46779475780309|552865|          8|
| 50.46881408984539|552466|         10|
+------------------+------+-----------+