Spark LDA与scala和Java
目前,我尝试使用Apache Spark和scala实现LDA算法,如下所示:Spark LDA与scala和Java,scala,apache-spark,lda,Scala,Apache Spark,Lda,目前,我尝试使用Apache Spark和scala实现LDA算法,如下所示: // Filter out stopwords val stopwords: Array[String] = sc.textFile("data/english_stops_words.txt").collect() val filteredTokens = new StopWordsRemover() .setStopWords(stopwords) .setCaseSensitive(false)
// Filter out stopwords
val stopwords: Array[String] = sc.textFile("data/english_stops_words.txt").collect()
val filteredTokens = new StopWordsRemover()
.setStopWords(stopwords)
.setCaseSensitive(false)
.setInputCol("words")
.setOutputCol("filtered")
.transform(tokens)
// Limit to top `vocabSize` most common words and convert to word count vector features
val cvModel = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(vocabSize)
.fit(filteredTokens)
val countVectors = cvModel.transform(filteredTokens)
.select("docId", "features")
.map { case Row(docId: Long, countVector: Vector) => (docId, countVector) }
.cache()
但在此之后,我将此代码从scala转换为Java API:
// Filter out stopwords
List<String> stopwords = sc.textFile("data/english_stops_words.txt")
.collect();
DataFrame filteredTokens = new StopWordsRemover()
.setStopWords(stopwords.toArray(new String[0]))
.setCaseSensitive(false).setInputCol("words")
.setOutputCol("filtered").transform(tokens);
// Limit to top `vocabSize` most common words and convert to word count
// vector features
CountVectorizerModel cvModel = new CountVectorizer()
.setInputCol("filtered").setOutputCol("features")
.setVocabSize(vocabSize).fit(filteredTokens);
JavaRDD<TextId> countVectors = cvModel.transform(filteredTokens)
.select("docId", "features").toJavaRDD()
.map(new Function<Row, TextId>() {
private static final long serialVersionUID = 1L;
@Override
public TextId call(Row row) throws Exception {
return new TextId(row.get(0).toString(), Long.parseLong(row.get(1).toString()));
}
}).cache();
//过滤掉停止字
List stopwords=sc.textFile(“data/english\u stops\u words.txt”)
.收集();
DataFrame filteredTokens=新StopWordsRemover()
.setStopWords(stopwords.toArray(新字符串[0]))
.setCaseSensitive(false).setInputCol(“文字”)
.setOutputCol(“过滤”).transform(令牌);
//限制在“vocabSize”最常用单词的顶部,并转换为字数
//矢量特征
CountVectorizerModel cvModel=新的CountVectorizer()
.setInputCol(“过滤”).setOutputCol(“功能”)
.setVocabSize(vocabSize).fit(filteredTokens);
JavaRDD countVectors=cvModel.transform(filteredTokens)
.select(“docId”、“features”).toJavaRDD()
.map(新函数(){
私有静态最终长serialVersionUID=1L;
@凌驾
公共TextId调用(行)引发异常{
返回新的TextId(row.get(0).toString(),Long.parseLong(row.get(1).toString());
}
}).cache();
但LDA模型只接受run()函数的javapairdd参数。当我尝试将countVectors解析为JavaPairdd,而scala代码可以做到这一点时,我遇到了麻烦。
如果您有其他解决方案,请帮助我。
多谢各位
编辑:
我已更改我的代码,如下所示:
JavaPairRDD<Long, Vector> countVectors = cvModel.transform(filteredTokens)
.select("docId", "features").toJavaRDD()
.mapToPair(new PairFunction<Row, Long, Vector>() {
public Tuple2<Long, Vector> call(Row row) throws Exception {
return new Tuple2<Long, Vector>(Long.parseLong(row.getString(0)), Vectors.dense(row.getDouble(1)));
}
}).cache();
javapairdd countVectors=cvModel.transform(filteredTokens)
.select(“docId”、“features”).toJavaRDD()
.mapToPair(新的PairFunction(){
公共Tuple2调用(行)引发异常{
返回新的Tuple2(Long.parseLong(row.getString(0)),Vectors.dense(row.getDouble(1));
}
}).cache();
非常感谢@Till Rohrmann。
但运行程序后,我收到异常消息:
线程“main”java.lang.NoSuchMethodError中出现异常:org.apache.spark.sql.Column.as(Ljava/lang/String;Lorg/apache/spark/sql/types/Metadata;)Lorg/apache/spark/sql/Column;
位于org.apache.spark.ml.feature.StopWordsRemover.transform(StopWordsRemover.scala:144)
您能帮我解决这个问题吗?您可以使用
mapToPair
方法创建javapairdd
假设TextId
有一个String
和一个Long
字段,则代码可以如下所示:
// Filter out stopwords
val stopwords: Array[String] = sc.textFile("data/english_stops_words.txt").collect()
val filteredTokens = new StopWordsRemover()
.setStopWords(stopwords)
.setCaseSensitive(false)
.setInputCol("words")
.setOutputCol("filtered")
.transform(tokens)
// Limit to top `vocabSize` most common words and convert to word count vector features
val cvModel = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(vocabSize)
.fit(filteredTokens)
val countVectors = cvModel.transform(filteredTokens)
.select("docId", "features")
.map { case Row(docId: Long, countVector: Vector) => (docId, countVector) }
.cache()
javapairdd countVectors=cvModel.transform(filteredTokens)
.select(“docId”、“features”).toJavaRDD()
.mapToPair(新的PairFunction(){
公共Tuple2调用(行)引发异常{
返回新的Tuple2(row.getAs[Long](0),row.getAs[Vector](1));
}
}).cache();
对于我的错误,我非常非常抱歉,但是spark中的LDA模型接受参数javapairdd而不是javapairdd。你能帮我解决这个问题吗?当然,你只需要解析第二列的向量。我将更新我的解决方案。我在尝试返回新的Tuple2(row.getAs[Long](0),row.getAs[Vector](1))时出错;你能更新另一个解决方案吗。非常感谢。我在编辑问题时更改了代码。但是,我在使用StopWordsRemover类时遇到了异常。消息是:java.lang.NoSuchMethodError:org.apache.spark.sql.Column.as(Ljava/lang/String;Lorg/apache/spark/sql/types/Metadata;)Lorg/apache/spark/sql/Column;你能帮我解决这个问题吗?嗨,克里斯托弗瑞查,我能帮你真高兴。以下就是我对LDA所做的工作: