Scala Spark ML管道引发随机林分类的异常：列标签必须是DoubleType类型，但实际上是IntegerType_Scala_Apache Spark_Apache Spark Ml

Scala Spark ML管道引发随机林分类的异常：列标签必须是DoubleType类型，但实际上是IntegerType

scala apache-spark

Scala Spark ML管道引发随机林分类的异常：列标签必须是DoubleType类型，但实际上是IntegerType,scala,apache-spark,apache-spark-ml,Scala,Apache Spark,Apache Spark Ml,我试图用随机森林分类器创建一个Spark ML管道来执行分类（而不是回归），但我得到一个错误，即我的训练集中预测的标签应该是双精度而不是整数。我按照这些页面中的说明进行操作： “”（apache.org） “”（stack overflow.com） “” （sparktutorials.net）我有一个包含以下列的Spark数据框： scala> df.show(5) +-------+----------+----------+---------+-----+ | userId|

我试图用随机森林分类器创建一个Spark ML管道来执行分类（而不是回归），但我得到一个错误，即我的训练集中预测的标签应该是双精度而不是整数。我按照这些页面中的说明进行操作：

“”（apache.org）
“”（stack overflow.com）
“” （sparktutorials.net）

我有一个包含以下列的Spark数据框：

scala> df.show(5)
+-------+----------+----------+---------+-----+
| userId|duration60|duration30|duration1|label|
+-------+----------+----------+---------+-----+
|user000|        11|        21|       35|    3|
|user001|        28|        41|       28|    4|
|user002|        17|         6|        8|    2|
|user003|        39|        29|        0|    1|
|user004|        26|        23|       25|    3|
+-------+----------+----------+---------+-----+


scala> df.printSchema()
root
 |-- userId: string (nullable = true)
 |-- duration60: integer (nullable = true)
 |-- duration30: integer (nullable = true)
 |-- duration1: integer (nullable = true)
 |-- label: integer (nullable = true)

我使用功能列duration60、duration30和duration1来预测分类列标签

然后我设置了Spark脚本，如下所示：

import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.{Pipeline, PipelineModel}


Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)

val sqlContext = new SQLContext(sc)
val df = sqlContext.read.
    format("com.databricks.spark.csv").
    option("header", "true"). // Use first line of all files as header
    option("inferSchema", "true"). // Automatically infer data types
    load("/tmp/features.csv").
    withColumnRenamed("satisfaction", "label").
    select("userId", "duration60", "duration30", "duration1", "label")

val assembler = new VectorAssembler().
    setInputCols(Array("duration60", "duration30", "duration1")).
    setOutputCol("features")


val randomForest = new RandomForestClassifier().
    setLabelCol("label").
    setFeaturesCol("features").
    setNumTrees(10)

var pipeline = new Pipeline().setStages(Array(assembler, randomForest))

var model = pipeline.fit(df);

转换后的数据帧如下所示：

scala> assembler.transform(df).show(5)
+-------+----------+----------+---------+-----+----------------+
| userId|duration60|duration30|duration1|label|        features|
+-------+----------+----------+---------+-----+----------------+
|user000|        11|        21|       35|    3|[11.0,21.0,35.0]|
|user001|        28|        41|       28|    4|[28.0,41.0,28.0]|
|user002|        17|         6|        8|    2|  [17.0,6.0,8.0]|
|user003|        39|        29|        0|    1| [39.0,29.0,0.0]|
|user004|        26|        23|       25|    3|[26.0,23.0,25.0]|
+-------+----------+----------+---------+-----+----------------+

但是，最后一行抛出一个异常：

java.lang.IllegalArgumentException:需求失败：列标签必须是DoubleType类型，但实际上是IntegerType

这意味着什么？我该如何修复它

为什么

标签

列需要是双精度的？我在做预测，而不是回归，所以我认为字符串或整数是合适的。预测列的双精度值通常意味着回归。

执行

转换双精度类型，因为这是算法所期望的类型
import org.apache.spark.sql.types._
df.withColumn("label", 'label cast DoubleType)

因此，在应用程序中的val df
之前，在序列的最后一行执行强制转换：
import org.apache.spark.sql.types._
val df = sqlContext.read.
    format("com.databricks.spark.csv").
    option("header", "true"). // Use first line of all files as header
    option("inferSchema", "true"). // Automatically infer data types
    load("/tmp/features.csv").
    withColumnRenamed("satisfaction", "label").
    select("userId", "duration60", "duration30", "duration1", "label")
    .withColumn("label", 'label cast DoubleType) // <-- HERE

import org.apache.spark.sql.types_
val df=sqlContext.read。
格式（“com.databricks.spark.csv”）。
选项（“标题”、“正确”）。//使用所有文件的第一行作为标题
选项（“推断模式”、“真”）。//自动推断数据类型
加载（“/tmp/features.csv”）。
重新命名为（“满意”、“标签”）。
选择（“userId”、“duration60”、“duration30”、“duration1”、“label”）
.pyspark中的列（“标签”，“标签转换双类型）//
from pyspark.sql.types import DoubleType
df = df.withColumn("label", df.label.cast(DoubleType()))

如果您正在使用pyspark并面临相同的问题
from pyspark.ml.feature import StringIndexer
   stringIndexer = StringIndexer(inputCol="label", outputCol="newlabel")
   model = stringIndexer.fit(df)
   df = model.transform(df)
   df.printSchema()

这是将标签列转换为“double”类型的一种方法。
谢谢。现在我得到一个错误：RandomForestClassifier的输入带有无效的标签列标签，没有指定类的数量。请参阅StringIndexer。
这是什么意思？如果我做分类，为什么标签应该是双重的？通常连续因变量用于回归，而不是分类。@stackoverflowuser2010手动铸造的列缺少ML估计器工作所需的元数据。您必须手动添加此项。例如，我投了反对票，因为你的答案有误导性（其他方法似乎不起作用。
）。当您有数字标签时，其他答案也会起作用！对不起，我的错误，因此对其进行了相应的编辑。谢谢你指出这仍然不准确。您可能需要指出的是，如果您有非数字标签，StringIndexer会将它们转换为所需的格式，而这不是强制转换