Performance 隐式推荐器调整超参数Pypark_Performance_Apache Spark_Pyspark_Implicit

Performance 隐式推荐器调整超参数Pypark

performance apache-spark pyspark

Performance 隐式推荐器调整超参数Pypark,performance,apache-spark,pyspark,implicit,Performance,Apache Spark,Pyspark,Implicit,computeMAPK函数使用模型、实际数据和验证数据（用户、产品）生成评级。然后对每个用户的预测评分进行排序，并取top K与实际数据进行比较，以计算K处的平均精度我正在使用此功能调整超参数，即适合多个模型，并选择具有最高MAPK的最佳Lambda、Alpha等级。这适用于小型数据集，但当矩阵变成1000万用户*200个产品时。它会断裂，尤其是通过reduceByKey步骤和连接。有没有更好的方法来调整ALS隐式的超参数？我使用的是Spark 1.3 实际RDD的形式为（用户、产品）有效R

computeMAPK函数使用模型、实际数据和验证数据（用户、产品）生成评级。然后对每个用户的预测评分进行排序，并取top K与实际数据进行比较，以计算K处的平均精度

我正在使用此功能调整超参数，即适合多个模型，并选择具有最高MAPK的最佳Lambda、Alpha等级。这适用于小型数据集，但当矩阵变成1000万用户*200个产品时。它会断裂，尤其是通过reduceByKey步骤和连接。有没有更好的方法来调整ALS隐式的超参数？我使用的是Spark 1.3

实际RDD的形式为（用户、产品）有效RDD的形式为（用户、产品）

您在缩进方面有一些问题-请修复Hi Tom，重新格式化并更新代码。Ash，您在这方面取得了任何进展吗？是的，我增加了集群大小并使用Scala和Spark 2.0。我看到数据帧的性能非常好，并且使用了Spark的MAP实现。我不打算使用Spark提供的MAP函数的原因是因为你不能给它一个参数“k”。但是，我认为我可以使用平均精度@k，然后取这些结果的平均值……你在缩进方面有一些问题-请修复Hi Tom，重新格式化并更新代码。Ash，你在这方面有什么进展吗？是的，我增加了集群大小并使用Scala和Spark 2.0。我看到数据帧的性能非常好，并且使用了Spark的MAP实现。我不打算使用Spark提供的MAP函数的原因是因为你不能给它一个参数“k”。然而，我认为我可以使用平均精度@k，然后取这些结果的平均值。。。

def apk(act_pred):
      predicted = act_pred[0]
      actual = act_pred[1]
      k = act_pred[2]
      if len(predicted)>k:
          predicted = predicted[:k]
      score =0.0
      num_hits = 0.0

      for i,p in enumerate(predicted):
            if p in actual and p not in predicted[:i]:
                      num_hits += 1.0
                      score += num_hits / (i+1.0)

      if not actual:
          return 1.0

      #return num_hits
      return (score/min(len(actual),k))



def computeMAPKR(model,actual,valid,k):
    pred = model.predictAll(valid).map(lambda x:(x[0],[(x[1],x[2])])).cache()
    gp = pred.reduceByKey(lambda x,y:x+y)
    #gp = pred.groupByKey().map(lambda x : (x[0], list(x[1])))


    # for every user, sort the items by predicted ratings and get user, item pairs
    def f(x): 
        s = sorted(x,key=lambda x:x[1],reverse=True)
        sm = map(lambda x:x[0],s)
        return sm

    sp = gp.mapValues(f)



    # actual data
    ac = actual.map(lambda x:(x[0],[(x[1])]))

    #gac = ac.reduceByKey(lambda x,y:(x,y)).map(lambda x : (x[0], list(x[1])))
    gac = ac.reduceByKey(lambda x,y:x+y)

    ap = sp.join(gac)


    apk_result = ap.map(lambda x:(x[0],(x[1][0],x[1][1],k))).mapValues(apk)
    mapk = apk_result.map(lambda x :x[1]).reduce(add) / ap.count()

    #print(apk_result.collect()) 
    return mapk