Machine learning 从PySpark ml模型获取训练集中实例的概率
我将决策树训练为二元分类器,目标是获得所有实例中每个标签(0,1)的概率,即训练和测试集中的概率。我计划使用这些概率来离散预测列中的连续值 训练集和测试集的概率可通过scikit学习中的预测概率获得:Machine learning 从PySpark ml模型获取训练集中实例的概率,machine-learning,pyspark,training-data,Machine Learning,Pyspark,Training Data,我将决策树训练为二元分类器,目标是获得所有实例中每个标签(0,1)的概率,即训练和测试集中的概率。我计划使用这些概率来离散预测列中的连续值 训练集和测试集的概率可通过scikit学习中的预测概率获得: # Train set tree_model.predict_proba(X_train.age.to_frame()) # Test set tree_model.predict_proba(X_test.age.to_frame()) 但PySpark的情况似乎并非如此: from pysp
# Train set
tree_model.predict_proba(X_train.age.to_frame())
# Test set
tree_model.predict_proba(X_test.age.to_frame())
但PySpark的情况似乎并非如此:
from pyspark.ml import Pipeline
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=5)
pipeline = Pipeline(stages=[dt])
model = pipeline.fit(train_data)
predictions = model.transform(test_data)
测试集实例的概率写入预测数据帧:
predictions.show(5,truncate=False)
+--------+-----+-------------+---------------------------------------+----------+
|features|label|rawPrediction|probability |prediction|
+--------+-----+-------------+---------------------------------------+----------+
|[0.0] |1.0 |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0 |
|[0.0] |1.0 |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0 |
|[0.0] |1.0 |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0 |
|[0.0] |0.0 |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0 |
|[32.0] |0.0 |[3.0,1.0] |[0.75,0.25] |0.0 |
+--------+-----+-------------+---------------------------------------+----------+
only showing top 5 rows
如何获取训练集实例的概率