String 带字符串标签的Spark ALS-转换回字符串_String_Apache Spark_Recommendation Engine

String 带字符串标签的Spark ALS-转换回字符串

string apache-spark

String 带字符串标签的Spark ALS-转换回字符串,string,apache-spark,recommendation-engine,String,Apache Spark,Recommendation Engine,我有以下代码： val userIndexer: StringIndexer = new StringIndexer() .setInputCol("userKey") .setOutputCol("user") val alsRatings = userIndexerModel.transform(ratings) val matrixFactorizationModel = ALS.trainImplicit(alsRatings.rdd, rank = 10, it

我有以下代码：

val userIndexer: StringIndexer = new StringIndexer()
      .setInputCol("userKey")
      .setOutputCol("user")
val alsRatings = userIndexerModel.transform(ratings)
val matrixFactorizationModel = ALS.trainImplicit(alsRatings.rdd, rank = 10, iterations = 10)
val rec = matrixFactorizationModel.recommendProductsForUsers(20)

这将返回带有用户ID的建议。我想找回我的用户密钥字符串。更有效的方法是什么？谢谢

PD：我当然无法理解为什么ALS库开发人员不接受字符串标签。从外部处理转换（从字符串到int，再从int到字符串）是非常痛苦和昂贵的。希望他们的待办事项中有问题。

我通常运行

StringIndexer

收集驱动程序中的标签。及使用索引并行化标签。而不是使用StringIndexer调用Transform。我加入数据帧以获得与

StringIndexer

相同的结果

val swidConverter = new StringIndexer()
  .setInputCol("id")
  .setOutputCol("idIndex").fit(df)

val idDf = spark.sparkContext.parallelize(
            swidConverter.labels.zipWithIndex
        ).toDF("id", "idIndex").repartition(PARTITION_SIZE) // set the partition size depending on your data size.

// Joining the idDf(DataFrame) with the actual Data.
val indexedDF = df.join(idDf,idDf.col("id")===df.col("id")).select("idIndex","product_id","rating")

val als = new ALS()
  .setMaxIter(5)
  .setRegParam(0.01)
  .setUserCol("idIndex")
  .setItemCol("product_id")
  .setRatingCol("rating")

val model = als.fit(indexedDF)
val resultRaw = model.recommendForAllUsers(4)

// Joining the idDf(DataFrame) with the Result to get the original ID from the indexed Id.
val resultDf = resultRaw.join(idDf,resultRaw.col("idIndex")===idDf.col("idIndex")).select("id","recommendations")

比如说。Python中的相同API:IndexToStrings在您有另一个数据帧时不起作用，它在应用StringToIndex的同一数据帧中使用元数据。如果正确使用，它可以正常工作：）检查例如

setLabels

。是的，但是setLabels意味着在节点中收集标签，因为它与数组一起工作，不适用于RRD或数据集。如果标签数组非常大，这可能无法扩展：/n您知道

StringIndexer

已经在驱动程序内存中存储了所有标签，对吗？