Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 在Spark 2.0中访问向量列时发生匹配错误_Scala_Apache Spark_Apache Spark Sql_Apache Spark Mllib_Apache Spark Ml - Fatal编程技术网

Scala 在Spark 2.0中访问向量列时发生匹配错误

Scala 在Spark 2.0中访问向量列时发生匹配错误,scala,apache-spark,apache-spark-sql,apache-spark-mllib,apache-spark-ml,Scala,Apache Spark,Apache Spark Sql,Apache Spark Mllib,Apache Spark Ml,我试图在JSON文件上创建一个LDA模型 使用JSON文件创建spark上下文: import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master("local") .appName("my-spark-app") .config("spark.some.config.option", "config-value") .getOrCreate() val df =

我试图在JSON文件上创建一个LDA模型

使用JSON文件创建spark上下文:

import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder
  .master("local")
  .appName("my-spark-app")
  .config("spark.some.config.option", "config-value")
  .getOrCreate()

 val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt")
显示
df
应显示
DataFrame

display(df)
标记文本

import org.apache.spark.ml.feature.RegexTokenizer

// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
                .setPattern("[\\W_]+")
                .setMinTokenLength(4) // Filter away tokens with length < 4
                .setInputCol("text")
                .setOutputCol("tokens")

// Tokenize document
val tokenized_df = tokenizer.transform(df)
获取
stopwords

%sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words > -O /tmp/stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()
 import org.apache.spark.ml.feature.StopWordsRemover

 // Set params for StopWordsRemover
 val remover = new StopWordsRemover()
                   .setStopWords(stopwords) // This parameter is optional
                   .setInputCol("tokens")
                   .setOutputCol("filtered")

 // Create new DF with Stopwords removed
 val filtered_df = remover.transform(tokenized_df)
可选:将停止字复制到tmp文件夹

%fs cp file:/tmp/stopwords dbfs:/tmp/stopwords
收集所有的
stopwords

%sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words > -O /tmp/stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()
 import org.apache.spark.ml.feature.StopWordsRemover

 // Set params for StopWordsRemover
 val remover = new StopWordsRemover()
                   .setStopWords(stopwords) // This parameter is optional
                   .setInputCol("tokens")
                   .setOutputCol("filtered")

 // Create new DF with Stopwords removed
 val filtered_df = remover.transform(tokenized_df)
过滤掉
stopwords

%sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words > -O /tmp/stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()
 import org.apache.spark.ml.feature.StopWordsRemover

 // Set params for StopWordsRemover
 val remover = new StopWordsRemover()
                   .setStopWords(stopwords) // This parameter is optional
                   .setInputCol("tokens")
                   .setOutputCol("filtered")

 // Create new DF with Stopwords removed
 val filtered_df = remover.transform(tokenized_df)
显示过滤后的
df
应验证
stopwords
是否已被删除

 display(filtered_df)
将单词出现的频率矢量化

 import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.sql.Row
 import org.apache.spark.ml.feature.CountVectorizer

 // Set params for CountVectorizer
 val vectorizer = new CountVectorizer()
               .setInputCol("filtered")
               .setOutputCol("features")
               .fit(filtered_df)
验证
矢量器

 vectorizer.transform(filtered_df)
           .select("id", "text","features","filtered").show()
在此之后,我发现在LDA中安装此
矢量器时存在问题。我认为的问题是,
CountVectorizer
提供稀疏向量,但LDA需要密集向量。仍在努力解决问题

这里是map无法转换的例外情况

import org.apache.spark.mllib.linalg.Vector
val ldaDF = countVectors.map { 
             case Row(id: String, countVector: Vector) => (id, countVector) 
            }
display(ldaDF)
例外情况:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4083.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4083.0 (TID 15331, 10.209.240.17): scala.MatchError: [0,(1252,[13,17,18,20,30,37,45,50,51,53,63,64,96,101,108,125,174,189,214,221,224,227,238,268,291,309,328,357,362,437,441,455,492,493,511,528,561,613,619,674,764,823,839,980,1098,1143],[1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,5.0,1.0,2.0,2.0,1.0,4.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
LDA有一个工作样本,没有任何问题

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}

val a = Vectors.dense(Array(1.0,2.0,3.0))
val b = Vectors.dense(Array(3.0,4.0,5.0))
val df = Seq((1L,a),(2L,b),(2L,a)).toDF

val ldaDF = df.map { case Row(id: Long, countVector: Vector) => (id, countVector) } 

val model = new LDA().setK(3).run(ldaDF.javaRDD)
display(df)

唯一的区别是在第二个片段中,我们有一个密集矩阵。

这与稀疏性无关。由于Spark 2.0.0 ML
变形金刚
不再生成
o.a.s.mllib.linalg.VectorUDT
,而是
o.a.s.ML.linalg.VectorUDT
,并在本地映射到
o.a.s.ML.linalg.Vector
的子类。这些与旧的MLLib API不兼容,旧的MLLib API正在Spark 2.0.0中走向弃用

您可以使用
向量在之间转换为“旧”。fromML

import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.ml.linalg.{Vectors => NewVectors}

OldVectors.fromML(NewVectors.dense(1.0, 2.0, 3.0))
OldVectors.fromML(NewVectors.sparse(5, Seq(0 -> 1.0, 2 -> 2.0, 4 -> 3.0)))
但是如果您已经使用了ML转换器,那么使用LDA的
ML
实现就更有意义了

为方便起见,您可以使用隐式转换:

import scala.languageFeature.implicitConversions

object VectorConversions {
  import org.apache.spark.mllib.{linalg => mllib}
  import org.apache.spark.ml.{linalg => ml}

  implicit def toNewVector(v: mllib.Vector) = v.asML
  implicit def toOldVector(v: ml.Vector) = mllib.Vectors.fromML(v)
}

解决办法很简单,伙计们。。在下面找到

//import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.linalg.Vector
我改变了:

val ldaDF = countVectors.map { 
             case Row(id: String, countVector: Vector) => (id, countVector) 
            }
致:

它就像一个符咒!它与@zero323编写的内容一致

进口清单:

import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}

此外,与此类型不匹配相关的错误消息非常混乱。例如,线程“main”org.apache.spark.sql.AnalysisException中的
异常:由于数据类型不匹配,无法解析“UDF(VecFunction)”:参数1需要向量类型,但“VecFunction”是向量类型请注意,参数和预期输入是如何被称为向量类型的。是否应该是
org.apache.spark.mllib.*.linalg*.Vectors.fromML
?哦,顺便说一句,这很有帮助;)