如何从PySpark中的spark.ml中提取模型超参数?

如何从PySpark中的spark.ml中提取模型超参数?,pyspark,modeling,cross-validation,apache-spark-mllib,apache-spark-ml,Pyspark,Modeling,Cross Validation,Apache Spark Mllib,Apache Spark Ml,我正在修补PySpark文档中的一些交叉验证代码,并试图让PySpark告诉我选择了什么模型: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridB

我正在修补PySpark文档中的一些交叉验证代码,并试图让PySpark告诉我选择了什么模型:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
在pysparkshell中运行这个程序,我可以得到线性回归模型的系数,但是我似乎找不到交叉验证过程选择的
lr.regParam
的值。有什么想法吗

In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []

我的头也撞到了这面墙上,不幸的是,你只能得到特定模型的特定参数。幸运的是,对于逻辑回归,您可以访问截距和权重,不幸的是,您无法检索regParam。 这可以通过以下方式完成:

best_lr = cv.bestModel

#get weigths
best_lr.weights
>>>DenseVector([3.1573])

#or better
best_lr.coefficients
>>>DenseVector([3.1573])

#get intercept
best_lr.intercept
>>>-1.0829958115287153
正如我之前所写的,每个模型都有一些可以提取的参数。 总体而言,从管道中获取相关模型(例如交叉验证程序在管道上运行时的cv.bestModel)可以通过以下方式完成:

best_pipeline = cv.bestModel
best_pipeline.stages
>>>[Tokenizer_4bc8884ad68b4297fd3c,CountVectorizer_411fbdeb4100c2bfe8ef, PCA_4c538d67e7b8f29ff8d0,LogisticRegression_4db49954edc7033edc76]
每个模型都是通过简单的列表索引获得的

best_lr = best_pipeline.stages[3]

现在可以应用上述方法。

也遇到了这个问题。我发现你需要调用java属性,我不知道为什么。那么就这么做吧:

from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
                                .addGrid(lr.regParam, [0]) \
                                .addGrid(lr.elasticNetParam, [1]) \
                                .build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
                        evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel
打印出所需的参数:

>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1

这也适用于其他方法,如
extractParamMap()
。他们应该会很快解决这个问题。

假设cvModel3Day是您的模型名,可以在Spark Scala中提取参数,如下所示

val params = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].extractParamMap()

val depth = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxDepth

val iter = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxIter

val bins = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxBins

val features  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getFeaturesCol

val step = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getStepSize

val samplingRate  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getSubsamplingRate

实际上有两个问题:

  • 拟合模型有哪些方面(如系数和截距)
  • 使用什么元参数来拟合
    bestModel
不幸的是,拟合估计器(模型)的python api不允许(轻松)直接访问估计器的参数,这使得很难回答后一个问题

但是,使用JavaAPI有一个变通方法。为了完整性,首先需要对交叉验证模型进行完整设置

%pyspark
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
logit = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[logit])
paramGrid = ParamGridBuilder() \
    .addGrid(logit.regParam, [0, 0.01, 0.05, 0.1, 0.5, 1]) \
    .addGrid(logit.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1]) \
    .build()
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)
tuned_model = crossval.fit(train)
model = tuned_model.bestModel
然后可以使用java对象上的泛型方法来获取参数值,而无需显式引用方法,如
getRegParam()

这将执行以下步骤:

  • 从最佳模型的最后阶段获取估计器创建的拟合:
    crossval.fit(..).bestModel.stages[-1]
  • \u java\u obj
  • paramGrid
    (字典列表)获取所有配置的名称。仅使用第一行,假设它是实际网格,如中所示,每行包含相同的键。否则,您需要收集任何行中使用过的所有名称
  • 从java对象获取相应的参数标识符
  • Param
    实例传递给函数以获取实际值

  • 这花了几分钟来破译,但我发现了

    from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
    
        # prenotation: I've built out my model already and I am calling the validator ParamGridBuilder
    paramGrid = ParamGridBuilder() \
                              .addGrid(hashingTF.numFeatures, [1000]) \
                              .addGrid(linearSVC.regParam, [0.1, 0.01]) \
                              .addGrid(linearSVC.maxIter, [10, 20, 30]) \
                              .build()
    crossval = CrossValidator(estimator=pipeline,\
                              estimatorParamMaps=paramGrid,\
                              evaluator=MulticlassClassificationEvaluator(),\
                              numFolds=2)
    
    cvModel = crossval.fit(train)
    
    prediction = cvModel.transform(test)
    
    
    bestModel = cvModel.bestModel
    
        #applicable to your model to pull list of all stages
    for x in range(len(bestModel.stages)):
    print bestModel.stages[x]
    
    
        #get stage feature by calling correct Transformer then .get<parameter>()
    print bestModel.stages[3].getNumFeatures()
    
    来自pyspark.ml.tuning导入交叉验证程序,ParamGridBuilder
    #我已经建立了我的模型,我正在调用验证程序ParamGridBuilder
    paramGrid=ParamGridBuilder()\
    .addGrid(hashingTF.numFeatures,[1000])\
    .addGrid(linearSVC.regParam[0.1,0.01])\
    .addGrid(linearSVC.maxIter[10,20,30])\
    .build()
    crossval=CrossValidator(估计器=管道\
    参数映射=参数网格\
    evaluator=MultiClassificationEvaluator()\
    numFolds=2)
    cvModel=交叉值配合(列车)
    预测=cvModel.transform(测试)
    bestModel=cvModel.bestModel
    #适用于您的模型,以提取所有阶段的列表
    对于范围内的x(len(最佳模型阶段)):
    打印bestModel.stages[x]
    #通过调用正确的Transformer-then.get()获取阶段特性
    打印bestModel.stages[3].getNumFeatures()
    
    这可能不如wernerchao的答案好(因为在变量中存储超参数并不方便),但您可以通过以下方式快速查看交叉验证模型的最佳超参数:

    cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]
    
    (2020-05-21)

    我知道这是一个老问题,但我找到了一种方法。
    @Pierre Gourseaud为我们提供了一种获取最佳模型超参数的好方法

    hyperparams = model_cv.getEstimatorParamMaps()[np.argmax(model_cv.avgMetrics)]
    print(hyperparams)
    [(Param(parent='ALS_cd65d45ab31c', name='implicitPrefs', doc='whether to use implicit preference'),
      True),
     (Param(parent='ALS_cd65d45ab31c', name='nonnegative', doc='whether to use nonnegative constraint for least squares'),
      True),
     (Param(parent='ALS_cd65d45ab31c', name='coldStartStrategy', doc="strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'."),
      'drop'),
     (Param(parent='ALS_cd65d45ab31c', name='rank', doc='rank of the factorization'),
      28),
     (Param(parent='ALS_cd65d45ab31c', name='maxIter', doc='max number of iterations (>= 0).'),
      20),
     (Param(parent='ALS_cd65d45ab31c', name='regParam', doc='regularization parameter (>= 0).'),
      0.01),
     (Param(parent='ALS_cd65d45ab31c', name='alpha', doc='alpha for implicit preference'),
      20.0)]
    
    
    但这不是时尚的外观,所以你可以这样做:

    import re
    
    hyper_list = []
    
    for i in range(len(hyperparams.items())):
        hyper_name = re.search("name='(.+?)'", str([x for x in hyperparams.items()][i])).group(1)
        hyper_value = [x for x in hyperparams.items()][i][1]
    
        hyper_list.append({hyper_name: hyper_value})
    
    print(hyper_list)
    [{'implicitPrefs': True}, {'nonnegative': True}, {'coldStartStrategy': 'drop'}, {'rank': 28}, {'maxIter': 20}, {'regParam': 0.01}, {'alpha': 20.0}]
    

    在我的例子中,我已经训练了一个ALS模型,但它应该适用于您的例子,因为我也训练了交叉验证

    Spark Scala API中的相关问题:pyspark回答:确保标记答案(下面的wernerchao对我很有用)。我相信你的话,尽管这个项目现在对我来说是遥远的记忆…很好的收获。甚至比修复更好的功能是
    cvModel.getAllTheBestModelsParametersPlease()
    答案对我不起作用。正确的方法是:
    modelnonly.bestModel.stages[-1]。\u java\u obj.parent().getRegParam()
    。或者,如果不使用管道,只需删除
    阶段[-1]
    import re
    
    hyper_list = []
    
    for i in range(len(hyperparams.items())):
        hyper_name = re.search("name='(.+?)'", str([x for x in hyperparams.items()][i])).group(1)
        hyper_value = [x for x in hyperparams.items()][i][1]
    
        hyper_list.append({hyper_name: hyper_value})
    
    print(hyper_list)
    [{'implicitPrefs': True}, {'nonnegative': True}, {'coldStartStrategy': 'drop'}, {'rank': 28}, {'maxIter': 20}, {'regParam': 0.01}, {'alpha': 20.0}]