使用scala从索引中获取TF-IDF值
单词和索引没有按顺序排列。例如文档0,使用scala从索引中获取TF-IDF值,scala,tf-idf,apache-spark-ml,Scala,Tf Idf,Apache Spark Ml,单词和索引没有按顺序排列。例如文档0,to与388相关,但我不知道333与哪个单词相关。如何使用rawFeatures索引获取word。对于CountVectorizer,我可以使用countVectorizerModel.词汇表 import spark.implicits._ val df = spark.sparkContext.parallelize(Array( (0, "to to Scala for better integration with S
to
与388
相关,但我不知道333
与哪个单词相关。如何使用rawFeatures索引获取word。对于CountVectorizer
,我可以使用countVectorizerModel.词汇表
import spark.implicits._
val df = spark.sparkContext.parallelize(Array(
(0, "to to Scala for better integration with Spark, and easier collaboration other".split(" ")),
(1, "For example in the case when the document is mostly about".split(" ")),
(2, "you need to to put some import declarations and create some data".split(" "))
)).toDF("id", "content")
val hashingTF = new HashingTF().setInputCol("content").setOutputCol("rawFeatures").setNumFeatures(2000)
val featurizedData = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaleData = idfModel.transform(featurizedData)
rescaleData.show(false)
+---+------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |content |rawFeatures |features |
+---+------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0 |[to, to, Scala, for, better, integration, with, Spark,, and, easier, collaboration, other]|(2000,[333,388,460,674,935,941,1036,1474,1534,1650,1988],[1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|(2000,[333,388,460,674,935,941,1036,1474,1534,1650,1988],[0.28768207245178085,0.5753641449035617,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])|
|1 |[For, example, in, the, case, when, the, document, is, mostly, about] |(2000,[342,956,1076,1243,1281,1445,1710,1760,1777,1820],[1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0]) |(2000,[342,956,1076,1243,1281,1445,1710,1760,1777,1820],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,1.3862943611198906,0.6931471805599453,0.6931471805599453,0.6931471805599453]) |
|2 |[you, need, to, to, put, some, import, declarations, and, create, some, data] |(2000,[265,333,345,388,401,418,537,1400,1425,1695],[1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0]) |(2000,[265,333,345,388,401,418,537,1400,1425,1695],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.5753641449035617,0.6931471805599453,0.6931471805599453,0.6931471805599453,1.3862943611198906,0.6931471805599453,0.6931471805599453]) |
+---+------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+