Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
无法在spark 2.0中的数据集[(scala.Long,org.apache.spark.mllib.linalg.Vector)]上运行LDA_Scala_Apache Spark_Apache Spark Mllib - Fatal编程技术网

无法在spark 2.0中的数据集[(scala.Long,org.apache.spark.mllib.linalg.Vector)]上运行LDA

无法在spark 2.0中的数据集[(scala.Long,org.apache.spark.mllib.linalg.Vector)]上运行LDA,scala,apache-spark,apache-spark-mllib,Scala,Apache Spark,Apache Spark Mllib,下面是关于LDA示例的教程视频,我得到了以下问题: <console>:37: error: overloaded method value run with alternatives: (documents: org.apache.spark.api.java.JavaPairRDD[java.lang.Long,org.apache.spark.mllib.linalg.Vector])org.apache.spark.mllib.clustering.LDAModel &l

下面是关于LDA示例的教程视频,我得到了以下问题:

<console>:37: error: overloaded method value run with alternatives:
  (documents: org.apache.spark.api.java.JavaPairRDD[java.lang.Long,org.apache.spark.mllib.linalg.Vector])org.apache.spark.mllib.clustering.LDAModel <and>
  (documents: org.apache.spark.rdd.RDD[(scala.Long, org.apache.spark.mllib.linalg.Vector)])org.apache.spark.mllib.clustering.LDAModel
  cannot be applied to (org.apache.spark.sql.Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)])
     val model = run(lda_countVector)
                                   ^

Spark API在1.x和2.x分支之间更改。特别是DataFrame.map返回的数据集不是RDD,因此结果与旧的基于MLlib RDD的API不兼容。您应该首先将数据转换为RDD,如下所示:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}

val a = Vectors.dense(Array(1.0, 2.0, 3.0))
val b = Vectors.dense(Array(3.0, 4.0, 5.0))
val df = Seq((1L ,a), (2L, b), (2L, a)).toDF

val ldaDF = df.rdd.map { 
  case Row(id: Long, countVector: Vector) => (id, countVector) 
} 

val model = new LDA().setK(3).run(ldaDF)
或者,您可以转换为类型化数据集,然后再转换为RDD:

val model = new LDA().setK(3).run(df.as[(Long, Vector)].rdd)

我也遵循同样的例子。获取此错误。有什么建议吗

scala>lda\u countVector.take1 20/06/15 15:44:53错误TaskSetManager:阶段8.0中的任务0失败4次;中止工作
org.apache.spark.sparkeexception:作业因阶段失败而中止:阶段8.0中的任务0失败4次,最近的失败:在阶段8.0中丢失任务0.3 TID 16,brdn6232.target.com,执行者1:scala.MatchError:[06139,[0,11472313154964975275696047768358488589421144168719802025527563036536604504345599],类的[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]]​ root |-id:long nullable=false

|-features:vector nullable=true这不是类型,而是模式。哦,这是一个数据帧-countVectors:org.apache.spark.sql.dataframe=[id:bigint,features:vector]感谢eliasah对此进行研究,非常感谢!我试图实现此建议,现在它给了我另一个例外。但是,当我执行您的代码时,它工作正常,因此我将投票赞成并将您的建议标记为答案。如果我在数据中发现问题,我将响应此线程。这似乎是Scala版本错误。请检查如果群集运行的是与您相同版本的Scala。群集在Spark 2.0 Scala 2.10Hi@eliasah中有没有办法将CountVectorizer转换为稠密向量。我相信LDA会引发异常,因为CountVectorizer是givign稀疏向量。这就是为什么它在您的示例中工作正常,但在我的示例中却不是。哇,这个hel谢谢你!谢谢你!
val model = new LDA().setK(3).run(df.as[(Long, Vector)].rdd)