Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 在spark structured streaming上显示“必须使用writeStream.start()执行具有流源的查询”时出错_Apache Spark_Apache Spark Sql_Spark Structured Streaming - Fatal编程技术网

Apache spark 在spark structured streaming上显示“必须使用writeStream.start()执行具有流源的查询”时出错

Apache spark 在spark structured streaming上显示“必须使用writeStream.start()执行具有流源的查询”时出错,apache-spark,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Spark Sql,Spark Structured Streaming,在spark结构上执行spark SQL时,我遇到了一些问题。 PFA的错误 这是我的密码 object sparkSqlIntegration { def main(args: Array[String]) { val spark = SparkSession .builder .appName("StructuredStreaming") .master("local[*]") .config("sp

在spark结构上执行spark SQL时,我遇到了一些问题。 PFA的错误

这是我的密码

 object sparkSqlIntegration {
    def main(args: Array[String]) {
     val spark = SparkSession
         .builder
         .appName("StructuredStreaming")
         .master("local[*]")
         .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
         .config("spark.sql.streaming.checkpointLocation", "file:///C:/checkpoint")
         .getOrCreate()

       setupLogging()
         val userSchema = new StructType().add("name", "string").add("age", "integer")
       // Create a stream of text files dumped into the logs directory
       val rawData =  spark.readStream.option("sep", ",").schema(userSchema).csv("file:///C:/Users/R/Documents/spark-poc-centri/csvFolder")

       // Must import spark.implicits for conversion to DataSet to work!
       import spark.implicits._
      rawData.createOrReplaceTempView("updates")
       val sqlResult= spark.sql("select * from updates")
       println("sql results here")
       sqlResult.show()
       println("Otheres")
       val query = rawData.writeStream.outputMode("append").format("console").start()

       // Keep going until we're stopped.
       query.awaitTermination()

       spark.stop()

    }
 }
在执行过程中,我得到以下错误。由于我是流媒体新手,谁能告诉我如何在spark结构化流媒体上执行spark SQL查询

2018-12-27 16:02:40 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, LAPTOP-5IHPFLOD, 6829, None)
2018-12-27 16:02:41 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6731787b{/metrics/json,null,AVAILABLE,@Spark}
sql results here
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///C:/Users/R/Documents/spark-poc-centri/csvFolder]
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
    at scala.collection.immutable.List.foreach(List.scala:392)

你不需要这些线路

导入spark.implicits_ rawData.createOrReplaceTempViewupdates val sqlResult=spark.sqlselect*from updates 在此处打印SQL结果 sqlResult.show 打印其他 最重要的是,不需要选择*。打印数据框时,您将已经看到所有列。因此,您也不需要注册临时视图来为其命名

当您格式化控制台时,就不再需要.show

用于从网络套接字读取并输出到控制台

val words = // omitted ... some Streaming DataFrame

// Generating a running word count
val wordCounts = words.groupBy("value").count()

// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
  .outputMode("complete")
  .format("console")
  .start()

query.awaitTermination()
Take-away-使用数据帧操作,如.select和.groupBy,而不是原始SQL

或者您可以使用Spark Streaming,您需要在每个流批上foreachRDD,然后将它们转换为数据帧,您可以查询这些数据帧

/**用于将RDD转换为数据帧的Case类*/ 大小写类Recordword:String val words=//省略了。。。一些流 //将单词DStream的RDD转换为DataFrame并运行SQL查询 words.foreachRDD{rdd:rdd[String],time:time=> //获取SparkSession的单例实例 val spark=SparkSessionSingleton.getInstancerdd.sparkContext.getConf 导入spark.implicits_ //将RDD[String]转换为RDD[case class]再转换为DataFrame val wordsDataFrame=rdd.mapw=>Recordw.toDF //使用DataFrame创建临时视图 wordsDataFrame.createOrReplaceTempViewwords //使用SQL对表进行字数统计并打印 val wordCountsDataFrame= spark.sqlselect word,按单词分组将*计算为总计 printlns===========$time========= wordCountsDataFrame.show } ssc.start ssc.1终止