Spark LDA与scala和Java

Spark LDA与scala和Java,scala,apache-spark,lda,Scala,Apache Spark,Lda,目前,我尝试使用Apache Spark和scala实现LDA算法,如下所示: // Filter out stopwords val stopwords: Array[String] = sc.textFile("data/english_stops_words.txt").collect() val filteredTokens = new StopWordsRemover() .setStopWords(stopwords) .setCaseSensitive(false)

目前,我尝试使用Apache Spark和scala实现LDA算法,如下所示:

    // Filter out stopwords
val stopwords: Array[String] = sc.textFile("data/english_stops_words.txt").collect()
val filteredTokens = new StopWordsRemover()
  .setStopWords(stopwords)
  .setCaseSensitive(false)
  .setInputCol("words")
  .setOutputCol("filtered")
  .transform(tokens)

// Limit to top `vocabSize` most common words and convert to word count vector features
val cvModel = new CountVectorizer()
  .setInputCol("filtered")
  .setOutputCol("features")
  .setVocabSize(vocabSize)
  .fit(filteredTokens)
val countVectors = cvModel.transform(filteredTokens)
  .select("docId", "features")
  .map { case Row(docId: Long, countVector: Vector) => (docId, countVector) }
  .cache()
但在此之后,我将此代码从scala转换为Java API:

        // Filter out stopwords
    List<String> stopwords = sc.textFile("data/english_stops_words.txt")
            .collect();
    DataFrame filteredTokens = new StopWordsRemover()
            .setStopWords(stopwords.toArray(new String[0]))
            .setCaseSensitive(false).setInputCol("words")
            .setOutputCol("filtered").transform(tokens);

    // Limit to top `vocabSize` most common words and convert to word count
    // vector features
    CountVectorizerModel cvModel = new CountVectorizer()
            .setInputCol("filtered").setOutputCol("features")
            .setVocabSize(vocabSize).fit(filteredTokens);

    JavaRDD<TextId> countVectors = cvModel.transform(filteredTokens)
              .select("docId", "features").toJavaRDD()
              .map(new Function<Row, TextId>() {

                private static final long serialVersionUID = 1L;

                @Override
                public TextId call(Row row) throws Exception {

                    return new TextId(row.get(0).toString(), Long.parseLong(row.get(1).toString()));
                }
            }).cache();
//过滤掉停止字
List stopwords=sc.textFile(“data/english\u stops\u words.txt”)
.收集();
DataFrame filteredTokens=新StopWordsRemover()
.setStopWords(stopwords.toArray(新字符串[0]))
.setCaseSensitive(false).setInputCol(“文字”)
.setOutputCol(“过滤”).transform(令牌);
//限制在“vocabSize”最常用单词的顶部,并转换为字数
//矢量特征
CountVectorizerModel cvModel=新的CountVectorizer()
.setInputCol(“过滤”).setOutputCol(“功能”)
.setVocabSize(vocabSize).fit(filteredTokens);
JavaRDD countVectors=cvModel.transform(filteredTokens)
.select(“docId”、“features”).toJavaRDD()
.map(新函数(){
私有静态最终长serialVersionUID=1L;
@凌驾
公共TextId调用(行)引发异常{
返回新的TextId(row.get(0).toString(),Long.parseLong(row.get(1).toString());
}
}).cache();
但LDA模型只接受run()函数的javapairdd参数。当我尝试将countVectors解析为JavaPairdd,而scala代码可以做到这一点时,我遇到了麻烦。 如果您有其他解决方案,请帮助我。 多谢各位

编辑: 我已更改我的代码,如下所示:

        JavaPairRDD<Long, Vector> countVectors = cvModel.transform(filteredTokens)
              .select("docId", "features").toJavaRDD()
              .mapToPair(new PairFunction<Row, Long, Vector>() {
                public Tuple2<Long, Vector> call(Row row) throws Exception {
                    return new Tuple2<Long, Vector>(Long.parseLong(row.getString(0)), Vectors.dense(row.getDouble(1)));
                }
            }).cache();
javapairdd countVectors=cvModel.transform(filteredTokens)
.select(“docId”、“features”).toJavaRDD()
.mapToPair(新的PairFunction(){
公共Tuple2调用(行)引发异常{
返回新的Tuple2(Long.parseLong(row.getString(0)),Vectors.dense(row.getDouble(1));
}
}).cache();
非常感谢@Till Rohrmann。 但运行程序后,我收到异常消息:

线程“main”java.lang.NoSuchMethodError中出现异常:org.apache.spark.sql.Column.as(Ljava/lang/String;Lorg/apache/spark/sql/types/Metadata;)Lorg/apache/spark/sql/Column; 位于org.apache.spark.ml.feature.StopWordsRemover.transform(StopWordsRemover.scala:144)


您能帮我解决这个问题吗?

您可以使用
mapToPair
方法创建
javapairdd

假设
TextId
有一个
String
和一个
Long
字段,则代码可以如下所示:

    // Filter out stopwords
val stopwords: Array[String] = sc.textFile("data/english_stops_words.txt").collect()
val filteredTokens = new StopWordsRemover()
  .setStopWords(stopwords)
  .setCaseSensitive(false)
  .setInputCol("words")
  .setOutputCol("filtered")
  .transform(tokens)

// Limit to top `vocabSize` most common words and convert to word count vector features
val cvModel = new CountVectorizer()
  .setInputCol("filtered")
  .setOutputCol("features")
  .setVocabSize(vocabSize)
  .fit(filteredTokens)
val countVectors = cvModel.transform(filteredTokens)
  .select("docId", "features")
  .map { case Row(docId: Long, countVector: Vector) => (docId, countVector) }
  .cache()
javapairdd countVectors=cvModel.transform(filteredTokens)
.select(“docId”、“features”).toJavaRDD()
.mapToPair(新的PairFunction(){
公共Tuple2调用(行)引发异常{
返回新的Tuple2(row.getAs[Long](0),row.getAs[Vector](1));
}
}).cache();

对于我的错误,我非常非常抱歉,但是spark中的LDA模型接受参数javapairdd而不是javapairdd。你能帮我解决这个问题吗?当然,你只需要解析第二列的
向量。我将更新我的解决方案。我在尝试返回新的Tuple2(row.getAs[Long](0),row.getAs[Vector](1))时出错;你能更新另一个解决方案吗。非常感谢。我在编辑问题时更改了代码。但是,我在使用StopWordsRemover类时遇到了异常。消息是:java.lang.NoSuchMethodError:org.apache.spark.sql.Column.as(Ljava/lang/String;Lorg/apache/spark/sql/types/Metadata;)Lorg/apache/spark/sql/Column;你能帮我解决这个问题吗?嗨,克里斯托弗瑞查,我能帮你真高兴。以下就是我对LDA所做的工作: