Scala 为什么foreachRDD不使用StreamingContext.textFileStream用新内容填充DataFrame？_Scala_Apache Spark_Apache Spark Sql_Spark Streaming_Apache Spark Mllib

Scala 为什么foreachRDD不使用StreamingContext.textFileStream用新内容填充DataFrame？

scala apache-spark

Scala 为什么foreachRDD不使用StreamingContext.textFileStream用新内容填充DataFrame？,scala,apache-spark,apache-spark-sql,spark-streaming,apache-spark-mllib,Scala,Apache Spark,Apache Spark Sql,Spark Streaming,Apache Spark Mllib,我的问题是，当我将代码更改为流模式并将数据帧放入foreach循环时，数据帧显示为空表！我没有填满！我也无法将其放入assembler.transform（）。错误是： Error:(38, 40) not enough arguments for method map: (mapFunc: String => U)(implicit evidence$2: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStre

我的问题是，当我将代码更改为流模式并将数据帧放入foreach循环时，数据帧显示为空表！我没有填满！我也无法将其放入assembler.transform（）。错误是：

Error:(38, 40) not enough arguments for method map: (mapFunc: String => U)(implicit evidence$2: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U].
Unspecified value parameter mapFunc.
      val dataFrame = Train_DStream.map()

我的train.csv文件如下所示：请帮帮我。这是我的密码：

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint, StreamingLinearRegressionWithSGD}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

import scala.util.Try

/**
  * Created by saeedtkh on 5/22/17.
  */
object ML_Test {
  def main(args: Array[String]) {

    val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
    val sc = new SparkContext(sparkConf)
    // Create the context
    val ssc = new StreamingContext(sc, Seconds(10))
    val sqlContext = new SQLContext(sc)

    val customSchema = StructType(Array(
      StructField("column0", StringType, true),
      StructField("column1", StringType, true),
      StructField("column2", StringType, true)))

      //val Test_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv").map(LabeledPoint.parse)
      val Train_DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
      val DStream =Train_DStream.map(line => line.split(">")).map(array => {
      val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
      val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
      val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
      Row.fromSeq(Seq(first, second, third))
    })

      DStream.foreachRDD { Test_DStream =>
      val dataFrame = sqlContext.createDataFrame(Test_DStream, customSchema)
      dataFrame.groupBy("column1", "column2").count().show()

      val numFeatures = 3
      val model = new StreamingLinearRegressionWithSGD()
          .setInitialWeights(Vectors.zeros(numFeatures))

      val featureCol = Array("column1", "column2")
      val assembler=new VectorAssembler().setInputCols(featureCol).setOutputCol("features")
      dataFrame.show()
      val df_new=assembler.transform(dataFrame)

    }

    ssc.start()
    ssc.awaitTermination()
  }
}

我猜

/Users/saeedtkh/Desktop/sharedsaeed/train.csv

目录下的所有文件都已处理完毕，因此没有剩下任何文件，因此数据框为空

请注意，的唯一输入参数是目录而不是文件

textFileStream（目录：String）：DStream[String]创建一个输入流，用于监视与Hadoop兼容的文件系统中的新文件并将其作为文本文件读取

还请注意，一旦在Spark Streaming应用程序中处理了文件，则不应更改（或附加）该文件，因为该文件已标记为已处理，Spark Streaming将忽略任何修改

引用Spark Streaming的官方文档：

Spark Streaming将监视目录

dataDirectory

，并处理在该目录中创建的任何文件（不支持在嵌套目录中写入的文件）。注意

这些文件必须具有相同的数据格式
必须通过将文件自动移动或重命名到数据目录中，在dataDirectory中创建这些文件
一旦移动，文件不得更改。因此，如果连续追加文件，则不会读取新数据

对于简单的文本文件，有一种更简单的方法

streamingContext.textFileStream（dataDirectory）

。文件流不需要运行接收器，因此不需要分配核心

还请将

setMaster（“local”）

替换为

setMaster（“local[*]）

，以确保您的Spark Streaming应用程序将有足够的线程来处理传入数据（您必须至少有2个线程）.

我猜

/Users/saeedtkh/Desktop/sharedsaeed/train.csv

目录下的所有文件都已处理完毕，因此没有留下任何文件，因此数据框为空