Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark CrossValidator.fit()-IllegalArgumentException:列预测的类型必须等于。。。[array<;double>;,array<;double>;],但类型为double_Apache Spark_Pyspark_Cross Validation_Apache Spark Ml - Fatal编程技术网

Apache spark CrossValidator.fit()-IllegalArgumentException:列预测的类型必须等于。。。[array<;double>;,array<;double>;],但类型为double

Apache spark CrossValidator.fit()-IllegalArgumentException:列预测的类型必须等于。。。[array<;double>;,array<;double>;],但类型为double,apache-spark,pyspark,cross-validation,apache-spark-ml,Apache Spark,Pyspark,Cross Validation,Apache Spark Ml,以下是我用于Python 3.9和Spark 3.1.1的包: from pyspark.ml import Pipeline from pyspark.ml.feature import VectorAssemble, StringIndexer from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.classification import LogisticRegression from

以下是我用于Python 3.9和Spark 3.1.1的包:

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssemble, StringIndexer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MultilabelClassificationEvaluator
我正在尝试将一个矢量化数据集
df_vec
推入CrossValidator函数,该数据集由两列组成,具有默认名称:

  • 功能
    -矢量汇编程序中的矢量
  • 标签
    -从StringIndexer索引的字符串数字
这是一个带有6标签的多项式逻辑回归问题

df_vec.printSchema()

运行以下步骤以设置CrossValidator:

mlr = LogisticRegression()
mlr_evaluator = MultilabelClassificationEvaluator()
paramGrid = ParamGridBuilder() \
    .addGrid(mlr.maxIter, [200]) \
    .build()

cross_validator = CrossValidator(
    estimator=mlr,
    estimatorParamMaps=paramGrid,
    evaluator=mlr_evaluator
)
尝试将CrossValidator对象与
df_vec
匹配会引发异常:

cv\u model=cross\u validator.fit(df\u vec)

所以,CrossValidator似乎出于某种原因希望使用另一种格式。如果CrossValidator()是细粒度的,我可以尝试使用
VectorAssembler
prediction
列转换为向量,但事实并非如此


有人知道如何解决这个问题吗?

您可以使用
MultiClassificationEvaluator
。您只有一个标签作为整数,因此使用多标签计算器没有意义

mlr = LogisticRegression()
mlr_evaluator = MultilabelClassificationEvaluator()
paramGrid = ParamGridBuilder() \
    .addGrid(mlr.maxIter, [200]) \
    .build()

cross_validator = CrossValidator(
    estimator=mlr,
    estimatorParamMaps=paramGrid,
    evaluator=mlr_evaluator
)
pyspark.sql.utils.IllegalArgumentException: requirement failed:
Column prediction must be of type equal to one of the following types:
[array<double>, array<double>] but was actually of type double.
x = mlr.fit(df_vec).transform(df_vec)
x.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: integer (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false) <---