Scala 用稀疏矩阵代替稠密矩阵的LSH方法
我尝试应用LSH()来计算一些向量的余弦相似性。对于我的真实数据,我有2M行(文档)和30K个属于它们的特征。此外,该矩阵是高度稀疏的。举个例子,假设我的数据如下:Scala 用稀疏矩阵代替稠密矩阵的LSH方法,scala,apache-spark,locality-sensitive-hash,Scala,Apache Spark,Locality Sensitive Hash,我尝试应用LSH()来计算一些向量的余弦相似性。对于我的真实数据,我有2M行(文档)和30K个属于它们的特征。此外,该矩阵是高度稀疏的。举个例子,假设我的数据如下: D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 D4 ... val input =
D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1
D4 ...
val input = "text.txt"
val conf = new SparkConf()
.setAppName("LSH-Cosine")
.setMaster("local[4]")
val storageLevel = StorageLevel.MEMORY_AND_DISK
val sc = new SparkContext(conf)
// read in an example data set of word embeddings
val data = sc.textFile(input, numPartitions).map {
line =>
val split = line.split(" ")
val word = split.head
val features = split.tail.map(_.toDouble)
(word, features)
}
// create an unique id for each word by zipping with the RDD index
val indexed = data.zipWithIndex.persist(storageLevel)
// create indexed row matrix where every row represents one word
val rows = indexed.map {
case ((word, features), index) =>
IndexedRow(index, Vectors.dense(features))
}
在相关代码中,特征被放在密集向量中,如下所示:
D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1
D4 ...
val input = "text.txt"
val conf = new SparkConf()
.setAppName("LSH-Cosine")
.setMaster("local[4]")
val storageLevel = StorageLevel.MEMORY_AND_DISK
val sc = new SparkContext(conf)
// read in an example data set of word embeddings
val data = sc.textFile(input, numPartitions).map {
line =>
val split = line.split(" ")
val word = split.head
val features = split.tail.map(_.toDouble)
(word, features)
}
// create an unique id for each word by zipping with the RDD index
val indexed = data.zipWithIndex.persist(storageLevel)
// create indexed row matrix where every row represents one word
val rows = indexed.map {
case ((word, features), index) =>
IndexedRow(index, Vectors.dense(features))
}
我想做的是使用稀疏矩阵,而不是使用密集矩阵。如何调整“Vectors.dense(features)”?稀疏向量的等效工厂方法是,它需要索引数组和非零项对应的值数组。余弦lsh join spark库中的方法签名基于通用向量类,因此该库似乎将接受稀疏或密集向量。稀疏向量的等效工厂方法是,它需要索引数组和非零项的相应值数组。余弦lsh连接火花库中的方法签名基于通用向量类,因此该库似乎可以接受稀疏或密集向量