Apache spark CrossValidator.fit()-IllegalArgumentException:列预测的类型必须等于。。。[array<;double>;,array<;double>;],但类型为double
以下是我用于Python 3.9和Spark 3.1.1的包:Apache spark CrossValidator.fit()-IllegalArgumentException:列预测的类型必须等于。。。[array<;double>;,array<;double>;],但类型为double,apache-spark,pyspark,cross-validation,apache-spark-ml,Apache Spark,Pyspark,Cross Validation,Apache Spark Ml,以下是我用于Python 3.9和Spark 3.1.1的包: from pyspark.ml import Pipeline from pyspark.ml.feature import VectorAssemble, StringIndexer from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.classification import LogisticRegression from
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssemble, StringIndexer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MultilabelClassificationEvaluator
我正在尝试将一个矢量化数据集df_vec
推入CrossValidator函数,该数据集由两列组成,具有默认名称:
-矢量汇编程序中的矢量功能
-从StringIndexer索引的字符串数字标签
df_vec.printSchema()
:
运行以下步骤以设置CrossValidator:
mlr = LogisticRegression()
mlr_evaluator = MultilabelClassificationEvaluator()
paramGrid = ParamGridBuilder() \
.addGrid(mlr.maxIter, [200]) \
.build()
cross_validator = CrossValidator(
estimator=mlr,
estimatorParamMaps=paramGrid,
evaluator=mlr_evaluator
)
尝试将CrossValidator对象与df_vec
匹配会引发异常:
cv\u model=cross\u validator.fit(df\u vec)
:
所以,CrossValidator似乎出于某种原因希望使用另一种格式。如果CrossValidator()是细粒度的,我可以尝试使用VectorAssembler
将prediction
列转换为向量,但事实并非如此
有人知道如何解决这个问题吗?您可以使用
MultiClassificationEvaluator
。您只有一个标签作为整数,因此使用多标签计算器没有意义
mlr = LogisticRegression()
mlr_evaluator = MultilabelClassificationEvaluator()
paramGrid = ParamGridBuilder() \
.addGrid(mlr.maxIter, [200]) \
.build()
cross_validator = CrossValidator(
estimator=mlr,
estimatorParamMaps=paramGrid,
evaluator=mlr_evaluator
)
pyspark.sql.utils.IllegalArgumentException: requirement failed:
Column prediction must be of type equal to one of the following types:
[array<double>, array<double>] but was actually of type double.
x = mlr.fit(df_vec).transform(df_vec)
x.printSchema()
root
|-- features: vector (nullable = true)
|-- label: integer (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false) <---