在Spark中将双类型数据帧转换为LabeledPoint时,Scala java.lang.String无法转换为java.lang.Double错误

在Spark中将双类型数据帧转换为LabeledPoint时,Scala java.lang.String无法转换为java.lang.Double错误,scala,apache-spark,null,type-conversion,spark-dataframe,Scala,Apache Spark,Null,Type Conversion,Spark Dataframe,我有一个2002年变量的数据集。所有变量都是数字。我首先将数据集读入Spark 1.5.0,并按照指令创建了一个双类型数据帧。然后,我按照说明和步骤将数据帧转换为LabeledPoint。然而,当我试图打印生成的LabeledPoint中的样本行时,我得到了“java.lang.ClassCastException:java.lang.String不能转换为java.lang.Double”错误。下面是我使用的Scala代码。很抱歉代码太长,但我希望这将有助于调试 谁能告诉我错误来自何处以及如何

我有一个2002年变量的数据集。所有变量都是数字。我首先将数据集读入Spark 1.5.0,并按照指令创建了一个双类型数据帧。然后,我按照说明和步骤将数据帧转换为LabeledPoint。然而,当我试图打印生成的LabeledPoint中的样本行时,我得到了
“java.lang.ClassCastException:java.lang.String不能转换为java.lang.Double”
错误。下面是我使用的Scala代码。很抱歉代码太长,但我希望这将有助于调试

谁能告诉我错误来自何处以及如何解决问题?非常感谢你的帮助

下面是我使用的Scala代码:

// Read in dataset but drop the header row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val trainRDD = sc.textFile("train.txt").filter(line => !line.contains("target"))
// Read in header file to get column names. Store in an Array.
val dictFile = "header.txt"
var arrName = new Array[String](2002)
for (line <- Source.fromFile(dictFile).getLines) {
    arrName = line.split('\t').map(_.trim).toArray
}

// Create dataframe using programmatically specifying the schema method
// Encode schema in a string
var schemaString = arrName.mkString(" ")
// Import Row
import org.apache.spark.sql.Row
// Import RDD
import org.apache.spark.rdd.RDD
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,LongType,FloatType,DoubleType}
// Generate the Double Type schema based on the string of schema
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
// Create rowRDD and convert String type to Double type
val arrVar = sc.broadcast(0 to 2001 toArray)
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}
val rowRDDTrain = createRowRDD(trainRDD, arrVar)
// Apply the schema to the RDD.
val trainDF = sqlContext.createDataFrame(rowRDDTrain, schema)
trainDF.printSchema
// Verified all 2002 variables are in "double (nullable = true)" format

// Define toLabeledPoint( ) to convert dataframe to LabeledPoint format
// Reference: https://stackoverflow.com/questions/31638770/rdd-to-labeledpoint-conversion
def toLabeledPoint(dataDF:org.apache.spark.sql.DataFrame) : org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = {
    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.regression.LabeledPoint
    val targetInd = dataDF.columns.indexOf("target")
    val ignored = List("target")
    val featInd = dataDF.columns.diff(ignored).map(dataDF.columns.indexOf(_))
    val dataLP = dataDF.rdd.map(r => LabeledPoint(r.getDouble(targetInd),
     Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
    return dataLP
}

// Create LabeledPoint from dataframe
val trainLP = toLabeledPoint(trainDF)
// Print out sammple rows in the generated LabeledPoint
trainLP.take(5).foreach(println)
// Failed: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
使用此选项将
null
值插补为0.0,然后问题就消失了:

def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => try {y.toDouble} catch {case _ : Throwable => 0.0}})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}

您能否将其减少为?您的数据中是否有
null
值?如果将
null
字符串强制转换为长字符串,可能会导致问题。谢谢David和zero323!大卫是对的。该问题是由数据中的
null
值引起的。我已经更新了我的原始帖子,添加了一个解决方案,将
null
值插补为0.0。多亏了你的大力帮助,这个问题很快就解决了!
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
    val rowRDD = rdd.map(_.split("\t")).map(_.map({y => try {y.toDouble} catch {case _ : Throwable => 0.0}})).map(p => Row.fromSeq(anArray.value map p))
    return rowRDD
}