Apache spark 无法使用XgBoost-PySpark训练模型_Apache Spark_Machine Learning_Pyspark_Data Science_Xgboost

Apache spark 无法使用XgBoost-PySpark训练模型

apache-spark machine-learning pyspark

Apache spark 无法使用XgBoost-PySpark训练模型,apache-spark,machine-learning,pyspark,data-science,xgboost,Apache Spark,Machine Learning,Pyspark,Data Science,Xgboost,我正在尝试使用Spark数据帧训练XgBoost模型，该数据帧如下所示： +--------------------+-------------------+ | features| TARGET_VAL| +--------------------+-------------------+ |(122,[0,1,9,10,11...| 0.0| |(122,[0,1,8,9,11,...| 14.577420000000002

我正在尝试使用Spark数据帧训练XgBoost模型，该数据帧如下所示：

+--------------------+-------------------+
|            features|         TARGET_VAL|
+--------------------+-------------------+
|(122,[0,1,9,10,11...|                0.0|
|(122,[0,1,8,9,11,...| 14.577420000000002|
|[4.0,1.0,0.0,0.0,...|           65.44524|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,9,10,...|           18.27017|
|(122,[0,1,8,11,12...|                0.0|
|(122,[0,1,8,10,11...|           75.75954|
|(122,[0,1,10,11,1...|           65.32013|
|[1.0,0.0,1.0,0.0,...|          171.16563|
|(122,[0,1,8,11,12...|                0.0|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,10,11...|            2.27041|
|(122,[0,1,11,12,2...|                0.0|
|[4.0,1.0,0.0,0.0,...|           76.08024|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,10,11...|           15.31895|
|(122,[0,1,8,10,11...|          122.56702|
|(122,[0,1,8,10,11...|-30.268179999999997|
|(122,[0,1,8,10,11...|                0.0|
|(122,[0,1,10,11,4...|          136.80025|
+--------------------+-------------------+

paramMap = {'eta': 0.1, 'subsample': 0.8}

xgbClassifier = XGBoostClassifier(**paramMap) \
    .setFeaturesCol("features") \
    .setLabelCol("TARGET_VAL")

我正在使用sparkxgb（XgBoost with PySpark），我正在对模型进行如下培训：

+--------------------+-------------------+
|            features|         TARGET_VAL|
+--------------------+-------------------+
|(122,[0,1,9,10,11...|                0.0|
|(122,[0,1,8,9,11,...| 14.577420000000002|
|[4.0,1.0,0.0,0.0,...|           65.44524|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,9,10,...|           18.27017|
|(122,[0,1,8,11,12...|                0.0|
|(122,[0,1,8,10,11...|           75.75954|
|(122,[0,1,10,11,1...|           65.32013|
|[1.0,0.0,1.0,0.0,...|          171.16563|
|(122,[0,1,8,11,12...|                0.0|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,10,11...|            2.27041|
|(122,[0,1,11,12,2...|                0.0|
|[4.0,1.0,0.0,0.0,...|           76.08024|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,10,11...|           15.31895|
|(122,[0,1,8,10,11...|          122.56702|
|(122,[0,1,8,10,11...|-30.268179999999997|
|(122,[0,1,8,10,11...|                0.0|
|(122,[0,1,10,11,4...|          136.80025|
+--------------------+-------------------+

paramMap = {'eta': 0.1, 'subsample': 0.8}

xgbClassifier = XGBoostClassifier(**paramMap) \
    .setFeaturesCol("features") \
    .setLabelCol("TARGET_VAL")

当我用以下方法培训模型时：

xgboostModel = xgbClassifier.fit(df)

我得到以下错误：

java.lang.IllegalArgumentException: requirement failed: Classifier found max label value = 23470.00821 but requires integers in range [0, ... 2147483647)

java.lang.IllegalArgumentException: requirement failed: Classifier inferred 23471 from label values in column XGBoostClassifier_37d67e9f2233__labelCol, but this exceeded the max numClasses (100) allowed to be inferred from values.  To avoid this error for labels with &gt; 100 classes, specify numClasses explicitly in the metadata; this can be done by applying StringIndexer to the label column.

因此，我将TARGET_VAL列强制转换为int，并在执行此操作时得到以下错误：

java.lang.IllegalArgumentException: requirement failed: Classifier found max label value = 23470.00821 but requires integers in range [0, ... 2147483647)

java.lang.IllegalArgumentException: requirement failed: Classifier inferred 23471 from label values in column XGBoostClassifier_37d67e9f2233__labelCol, but this exceeded the max numClasses (100) allowed to be inferred from values.  To avoid this error for labels with &gt; 100 classes, specify numClasses explicitly in the metadata; this can be done by applying StringIndexer to the label column.

我是XgBoost和机器学习的新手。我认为TARGET_VAL是经过训练的模型将为测试数据集预测的列，它应该是一个浮点值。那么，我做错了什么？我需要用不同的参数配置模型吗？

这里的问题是，因为

TARGET\u VAL

是连续变量列，而

XGBoostClassifier

需要离散/分类变量列。对于Classifier来说，课程太多了。正如您在错误中看到的，max

numclass

是100，我确信您有100多个数字

您正在使用回归问题的分类算法