Machine learning 从PySpark ml模型获取训练集中实例的概率

Machine learning 从PySpark ml模型获取训练集中实例的概率,machine-learning,pyspark,training-data,Machine Learning,Pyspark,Training Data,我将决策树训练为二元分类器,目标是获得所有实例中每个标签(0,1)的概率,即训练和测试集中的概率。我计划使用这些概率来离散预测列中的连续值 训练集和测试集的概率可通过scikit学习中的预测概率获得: # Train set tree_model.predict_proba(X_train.age.to_frame()) # Test set tree_model.predict_proba(X_test.age.to_frame()) 但PySpark的情况似乎并非如此: from pysp

我将决策树训练为二元分类器,目标是获得所有实例中每个标签(0,1)的概率,即训练和测试集中的概率。我计划使用这些概率来离散预测列中的连续值

训练集和测试集的概率可通过scikit学习中的预测概率获得:

# Train set
tree_model.predict_proba(X_train.age.to_frame())
# Test set
tree_model.predict_proba(X_test.age.to_frame())
但PySpark的情况似乎并非如此:

from pyspark.ml import Pipeline
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=5)
pipeline = Pipeline(stages=[dt])
model = pipeline.fit(train_data)
predictions = model.transform(test_data)
测试集实例的概率写入预测数据帧:

predictions.show(5,truncate=False)
+--------+-----+-------------+---------------------------------------+----------+
|features|label|rawPrediction|probability                            |prediction|
+--------+-----+-------------+---------------------------------------+----------+
|[0.0]   |1.0  |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0       |
|[0.0]   |1.0  |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0       |
|[0.0]   |1.0  |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0       |
|[0.0]   |0.0  |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0       |
|[32.0]  |0.0  |[3.0,1.0]    |[0.75,0.25]                            |0.0       |
+--------+-----+-------------+---------------------------------------+----------+
only showing top 5 rows
如何获取训练集实例的概率