如何从PySpark中的spark.ml中提取模型超参数?
我正在修补PySpark文档中的一些交叉验证代码,并试图让PySpark告诉我选择了什么模型:如何从PySpark中的spark.ml中提取模型超参数?,pyspark,modeling,cross-validation,apache-spark-mllib,apache-spark-ml,Pyspark,Modeling,Cross Validation,Apache Spark Mllib,Apache Spark Ml,我正在修补PySpark文档中的一些交叉验证代码,并试图让PySpark告诉我选择了什么模型: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridB
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
(Vectors.dense([0.4]), 1.0),
(Vectors.dense([0.5]), 0.0),
(Vectors.dense([0.6]), 1.0),
(Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
在pysparkshell中运行这个程序,我可以得到线性回归模型的系数,但是我似乎找不到交叉验证过程选择的lr.regParam
的值。有什么想法吗
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])
In [4]: cvModel.bestModel.explainParams()
Out[4]: ''
In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}
In [15]: cvModel.params
Out[15]: []
In [36]: cvModel.bestModel.params
Out[36]: []
我的头也撞到了这面墙上,不幸的是,你只能得到特定模型的特定参数。幸运的是,对于逻辑回归,您可以访问截距和权重,不幸的是,您无法检索regParam。 这可以通过以下方式完成:
best_lr = cv.bestModel
#get weigths
best_lr.weights
>>>DenseVector([3.1573])
#or better
best_lr.coefficients
>>>DenseVector([3.1573])
#get intercept
best_lr.intercept
>>>-1.0829958115287153
正如我之前所写的,每个模型都有一些可以提取的参数。
总体而言,从管道中获取相关模型(例如交叉验证程序在管道上运行时的cv.bestModel)可以通过以下方式完成:
best_pipeline = cv.bestModel
best_pipeline.stages
>>>[Tokenizer_4bc8884ad68b4297fd3c,CountVectorizer_411fbdeb4100c2bfe8ef, PCA_4c538d67e7b8f29ff8d0,LogisticRegression_4db49954edc7033edc76]
每个模型都是通过简单的列表索引获得的
best_lr = best_pipeline.stages[3]
现在可以应用上述方法。也遇到了这个问题。我发现你需要调用java属性,我不知道为什么。那么就这么做吧:
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
.addGrid(lr.regParam, [0]) \
.addGrid(lr.elasticNetParam, [1]) \
.build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel
打印出所需的参数:
>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1
这也适用于其他方法,如
extractParamMap()
。他们应该会很快解决这个问题。假设cvModel3Day是您的模型名,可以在Spark Scala中提取参数,如下所示
val params = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].extractParamMap()
val depth = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxDepth
val iter = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxIter
val bins = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxBins
val features = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getFeaturesCol
val step = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getStepSize
val samplingRate = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getSubsamplingRate
实际上有两个问题:
- 拟合模型有哪些方面(如系数和截距)
- 使用什么元参数来拟合
bestModel
%pyspark
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
logit = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[logit])
paramGrid = ParamGridBuilder() \
.addGrid(logit.regParam, [0, 0.01, 0.05, 0.1, 0.5, 1]) \
.addGrid(logit.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1]) \
.build()
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5)
tuned_model = crossval.fit(train)
model = tuned_model.bestModel
然后可以使用java对象上的泛型方法来获取参数值,而无需显式引用方法,如getRegParam()
:
这将执行以下步骤:
crossval.fit(..).bestModel.stages[-1]
\u java\u obj
paramGrid
(字典列表)获取所有配置的名称。仅使用第一行,假设它是实际网格,如中所示,每行包含相同的键。否则,您需要收集任何行中使用过的所有名称Param
实例传递给函数以获取实际值这花了几分钟来破译,但我发现了
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# prenotation: I've built out my model already and I am calling the validator ParamGridBuilder
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [1000]) \
.addGrid(linearSVC.regParam, [0.1, 0.01]) \
.addGrid(linearSVC.maxIter, [10, 20, 30]) \
.build()
crossval = CrossValidator(estimator=pipeline,\
estimatorParamMaps=paramGrid,\
evaluator=MulticlassClassificationEvaluator(),\
numFolds=2)
cvModel = crossval.fit(train)
prediction = cvModel.transform(test)
bestModel = cvModel.bestModel
#applicable to your model to pull list of all stages
for x in range(len(bestModel.stages)):
print bestModel.stages[x]
#get stage feature by calling correct Transformer then .get<parameter>()
print bestModel.stages[3].getNumFeatures()
来自pyspark.ml.tuning导入交叉验证程序,ParamGridBuilder
#我已经建立了我的模型,我正在调用验证程序ParamGridBuilder
paramGrid=ParamGridBuilder()\
.addGrid(hashingTF.numFeatures,[1000])\
.addGrid(linearSVC.regParam[0.1,0.01])\
.addGrid(linearSVC.maxIter[10,20,30])\
.build()
crossval=CrossValidator(估计器=管道\
参数映射=参数网格\
evaluator=MultiClassificationEvaluator()\
numFolds=2)
cvModel=交叉值配合(列车)
预测=cvModel.transform(测试)
bestModel=cvModel.bestModel
#适用于您的模型,以提取所有阶段的列表
对于范围内的x(len(最佳模型阶段)):
打印bestModel.stages[x]
#通过调用正确的Transformer-then.get()获取阶段特性
打印bestModel.stages[3].getNumFeatures()
这可能不如wernerchao的答案好(因为在变量中存储超参数并不方便),但您可以通过以下方式快速查看交叉验证模型的最佳超参数:
cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]
(2020-05-21)
我知道这是一个老问题,但我找到了一种方法。@Pierre Gourseaud为我们提供了一种获取最佳模型超参数的好方法
hyperparams = model_cv.getEstimatorParamMaps()[np.argmax(model_cv.avgMetrics)]
print(hyperparams)
[(Param(parent='ALS_cd65d45ab31c', name='implicitPrefs', doc='whether to use implicit preference'),
True),
(Param(parent='ALS_cd65d45ab31c', name='nonnegative', doc='whether to use nonnegative constraint for least squares'),
True),
(Param(parent='ALS_cd65d45ab31c', name='coldStartStrategy', doc="strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'."),
'drop'),
(Param(parent='ALS_cd65d45ab31c', name='rank', doc='rank of the factorization'),
28),
(Param(parent='ALS_cd65d45ab31c', name='maxIter', doc='max number of iterations (>= 0).'),
20),
(Param(parent='ALS_cd65d45ab31c', name='regParam', doc='regularization parameter (>= 0).'),
0.01),
(Param(parent='ALS_cd65d45ab31c', name='alpha', doc='alpha for implicit preference'),
20.0)]
但这不是时尚的外观,所以你可以这样做:
import re
hyper_list = []
for i in range(len(hyperparams.items())):
hyper_name = re.search("name='(.+?)'", str([x for x in hyperparams.items()][i])).group(1)
hyper_value = [x for x in hyperparams.items()][i][1]
hyper_list.append({hyper_name: hyper_value})
print(hyper_list)
[{'implicitPrefs': True}, {'nonnegative': True}, {'coldStartStrategy': 'drop'}, {'rank': 28}, {'maxIter': 20}, {'regParam': 0.01}, {'alpha': 20.0}]
在我的例子中,我已经训练了一个ALS模型,但它应该适用于您的例子,因为我也训练了交叉验证 Spark Scala API中的相关问题:pyspark回答:确保标记答案(下面的wernerchao对我很有用)。我相信你的话,尽管这个项目现在对我来说是遥远的记忆…很好的收获。甚至比修复更好的功能是
cvModel.getAllTheBestModelsParametersPlease()
答案对我不起作用。正确的方法是:modelnonly.bestModel.stages[-1]。\u java\u obj.parent().getRegParam()
。或者,如果不使用管道,只需删除阶段[-1]
。
import re
hyper_list = []
for i in range(len(hyperparams.items())):
hyper_name = re.search("name='(.+?)'", str([x for x in hyperparams.items()][i])).group(1)
hyper_value = [x for x in hyperparams.items()][i][1]
hyper_list.append({hyper_name: hyper_value})
print(hyper_list)
[{'implicitPrefs': True}, {'nonnegative': True}, {'coldStartStrategy': 'drop'}, {'rank': 28}, {'maxIter': 20}, {'regParam': 0.01}, {'alpha': 20.0}]