Python pyspark GBT分类器输出中的附加列_Python_Apache Spark_Pyspark

Python pyspark GBT分类器输出中的附加列

python apache-spark pyspark

Python pyspark GBT分类器输出中的附加列,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,在pyspark中的梯度增强树中，GBTClassifier对象的输入/参数定义为- 类pyspark.ml.classification.GBTClassifier（featuresCol='features'， labelCol='label'，predictionCol='prediction'，maxDepth=5， maxBins=32，minInstancesPerNode=1，minInfoGain=0.0， maxMemoryInMB=256，CacheNodeId=False，

在pyspark中的梯度增强树中，

GBTClassifier

对象的输入/参数定义为-

类pyspark.ml.classification.GBTClassifier（featuresCol='features'， labelCol='label'，predictionCol='prediction'，maxDepth=5， maxBins=32，minInstancesPerNode=1，minInfoGain=0.0， maxMemoryInMB=256，CacheNodeId=False，checkpointInterval=10， lossType='logistic'，maxIter=20，步长=0.1，种子=None，子采样率=1.0）

创建模型时，参数中没有

rawPredictionCol

或

probabilityCol

。它也没有

getRawPredictionCol

或

getProbabilityCol

方法。这些方法和参数适用于随机林、决策树和逻辑分类器

现在，当我拟合模型并应用一个变换时，我会得到另外两个列，分别是

rawPrediction

和

probability

以下是我直接从spark文档中使用的

from numpy import allclose
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame([
    (1.0, Vectors.dense(1.0)),
    (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(df)
td = si_model.transform(df)
gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed", seed=42)
model = gbt.fit(td)
model.featureImportances
allclose(model.treeWeights, [1.0, 0.1, 0.1, 0.1, 0.1])

test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
model.transform(test0).show()

示例中最后一行的输出为-

# +--------+--------------------+--------------------+----------+
# |features|       rawPrediction|         probability|prediction|
# +--------+--------------------+--------------------+----------+
# |  [-1.0]|[1.16967390812510...|[0.91208380260474...|       0.0|
# +--------+--------------------+--------------------+----------+

我不明白为什么这些列出现在我的输出中。当然没有输入参数，因为我在调用类时尝试传递

rawPredictionCol=rawPredictionCol

，但它抛出了一个错误

这是虫子吗？还是应该输出这些列？如果这些列要在输出中，那么在实例化类时如何设置它们的名称，为什么它们没有各自的

get

方法，比如

getFeaturesCol

，等等这不是一个bug。正如在返回的

rawPrediction

和

probability

中所述，这是预期的行为。此外，看起来确实有参数，您可以在其中为这些列设置值，但从v2.2.1开始，该功能似乎还没有添加到PySpark中。是否有方法可以设置输出变量的名称，使用类似于

setRawPredictionCol（）

？或者有没有一种方法可以访问Scala对象来设置名称？我对Scala不太熟悉，不知道这是否可行。我的两分钱是，通过PySpark访问Scala对象可能需要更多的工作，当调用

转换后，您可以使用alias
函数分配新列名。