Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala apachespark中如何使用DStream进行特征提取_Scala_Apache Spark_Feature Extraction_Dstream - Fatal编程技术网

Scala apachespark中如何使用DStream进行特征提取

Scala apachespark中如何使用DStream进行特征提取,scala,apache-spark,feature-extraction,dstream,Scala,Apache Spark,Feature Extraction,Dstream,我有通过数据流从卡夫卡传来的数据。我想进行特征提取,以获得一些关键字 我不想等待所有数据的到来(因为它是一个可能永远不会结束的连续流),所以我希望分块执行提取-准确性是否会受到影响对我来说并不重要 到目前为止,我总结了如下内容: def extractKeywords(stream: DStream[Data]): Unit = { val spark: SparkSession = SparkSession.builder.getOrCreate val streamWithWor

我有通过数据流从卡夫卡传来的数据。我想进行特征提取,以获得一些关键字

我不想等待所有数据的到来(因为它是一个可能永远不会结束的连续流),所以我希望分块执行提取-准确性是否会受到影响对我来说并不重要

到目前为止,我总结了如下内容:

def extractKeywords(stream: DStream[Data]): Unit = {

  val spark: SparkSession = SparkSession.builder.getOrCreate

  val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData

  val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _

  val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData

  streamWithFeatures.print()
}

def extractFeatures(spark: SparkSession)
                   (rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {

  val df = spark.createDataFrame(rdd).toDF("data", "words")

  val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
  val rawFeatures = hashingTF.transform(df)

  val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
  val idfModel = idf.fit(rawFeatures)

  val rescaledData = idfModel.transform(rawFeature)

  import spark.implicits._
  rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
但是,我收到了
java.lang.IllegalStateException:还没有看到任何文档。
-我并不感到惊讶,因为我只是尝试将这些内容组合在一起,我知道,由于我没有等待某些数据的到来,所以当我尝试在数据上使用生成的模型时,它可能是空的


解决此问题的正确方法是什么?

我使用了来自评论的建议,并将过程分为两次运行:

  • 计算IDF模型并将其保存到文件的程序

    def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
      val session: SparkSession = SparkSession.builder.getOrCreate
    
      val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
    
      val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
      val featurizedDf = hashingTF.transform(wordsDf)
    
      val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
      val idfModel = idf.fit(featurizedDf)
    
      idfModel.write.save(idfModelFile.getAbsolutePath)
    }
    
  • 一种从文件中读取IDF模型并在所有传入信息上运行它的方法

    val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
    
    val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
    
    val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
    val wordsDf = tokenizer.transform(documentDf)
    
    val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
    val featurizedDf = hashingTF.transform(wordsDf)
    
    val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
    val featuresDf = extractor.transform(featurizedDf)
    
    featuresDf.select("update", "features")