Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 基于apachespark的TFIDF余弦相似性_Scala_Apache Spark_Apache Spark Mllib_Tf Idf_Cosine Similarity - Fatal编程技术网

Scala 基于apachespark的TFIDF余弦相似性

Scala 基于apachespark的TFIDF余弦相似性,scala,apache-spark,apache-spark-mllib,tf-idf,cosine-similarity,Scala,Apache Spark,Apache Spark Mllib,Tf Idf,Cosine Similarity,我正在尝试使用ApacheSpark在TFIDF上计算余弦相似矩阵。 这是我的密码: def cosSim(input: RDD[Seq[String]]) = { val hashingTF = new HashingTF() val tf = hashingTF.transform(input) tf.cache() val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val mat = new RowM

我正在尝试使用ApacheSpark在TFIDF上计算余弦相似矩阵。 这是我的密码:

def cosSim(input: RDD[Seq[String]]) = {
  val hashingTF = new HashingTF()
  val tf = hashingTF.transform(input)
  tf.cache()
  val idf = new IDF().fit(tf)
  val tfidf = idf.transform(tf)
  val mat = new RowMatrix(tfidf)
  val sim = mat.columnSimilarities
  sim
}
我在输入中有大约3000行,但如果我使用sim.numRows()或sim.numCols(),我会看到1048576而不是3K,据我所知,这是因为val tfidf和val mat都有3K*1048576的大小,其中1048576是tf功能的数量。也许要解决这个问题,我必须转置mat,但我不知道怎么做。

您可以尝试:

import org.apache.spark.mllib.linalg.distributed._

val irm = new IndexedRowMatrix(rowMatrix.rows.zipWithIndex.map {
   case (v, i) => IndexedRow(i, v)
})

irm.toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities
您可以尝试:

import org.apache.spark.mllib.linalg.distributed._

val irm = new IndexedRowMatrix(rowMatrix.rows.zipWithIndex.map {
   case (v, i) => IndexedRow(i, v)
})

irm.toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

你能分享一下同样的PySpark版本吗。我已经创建了IDF模型,并希望找到余弦相似性。你能分享相同的PySpark版本吗。我已经创建了IDF模型,希望找到余弦相似性。